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Preface 


Regression analysis has been one of the most widely used statistical methodologies during the 
past 50 years for analyzing relationships among variables. Due to its flexibility, usefulness, 
applicability, theoretical and technical succinctness, regression analysis has become a basic 
Statistical tool to solve problems in the real world. In order to apply these elegant techniques 
successfully and effectively, one requires sound insight and understanding of both the 
underlying theory (i.e., statistical reasoning) and its practical application. 

This book is designed primarily as a standard course in regression analysis, and is an 
outgrowth of class notes for advanced undergraduates, graduate students and researchers in 
various fields of engineering, the chemical and physical sciences, mathematical sciences and 
statistics. Therefore it blends both theory and application so that the reader will gain a deep 
enough understanding of the basic principles necessary to apply regression model building 
techniques in a wide variety of environments. It contains conventional topics and recent 
practical developments. 

This book is also intended to fill a gap in what we perceive to be a communication gap 
faced by students or researchers who have a limited mathematical background but would like 
to continue from where most beginning statistics texts end, or who want to build up their 
knowledge of advanced statistical data analysis and modeling. 

The book consists of nine chapters. The first seven chapters are devoted to a fairly 
comprehensive description of linear regression modeling, methods of analysis and their 
ramifications. Chapters 8 and 9 are the concluding chapters which present the decision-making 
aspects of this book. 

In Chapter 1 we give a general introduction to the scope of regression analysis, its brief 
historical background and describe some typical applications of regression. 

Chapter 2 provides a review of some basic results in probability theory and basic concepts 
in statistics, which are essential to understand the general framework of the statistical method 
and to develop the further theory and its applications. 

In Chapter 3 we introduce the method of least squares for fitting straight lines. Since simple 
linear regression is the gateway to the analysis of the general linear model, we provide detailed 
theoretical aspects and a broad applied viewpoint as well. 

Chapter 4 provides a rigorous outline of indispensable results in linear algebra including 
matrix theory. This will help readers facilitate their understanding of later chapters. Matrix 
notation that is used throughout the book is introduced to improve readers' confidence and 
understanding. 

Chapter 5 discusses the study of multiple regression analysis along with a comprehensive 
approach to inference procedures. The material in Chapter 5 constitutes the core of the text. 

In Chapter 6, we are concerned with the analysis of residuals and detecting violations from 
model assumptions. Modern diagnostic methods are also discussed for outlier detection, 
plotting of residuals and the development of transformation techniques. 


Chapter 7 then considers more complicated models in linear regression апа their 
applications. These include polynomial models, radial basis functions, the use of dummy 
variables, logistic regression for which the response is qualitative/binary, and a basic treatment 
of the generalized linear model. 

Chapter 8 returns to the topic of evaluation criteria for subset selection and discusses 
various types of selection procedures. 

Finally, Chapter 9 is devoted to diagnosis and correction of multicollinearity which are 
common and serious problems in regression analysis. 

For classroom purposes, it is suggested that two semesters be taken for complete coverage. 
However it is possible to treat most of topics in one semester if some of the pre-requisite 
knowledge is assumed or to make the level/depth of the treatment less rigorous. Exercises are 
provided at the end of chapters two to nine. We have tried to balance theoretical aspects and 
applications to data analysis. The data are mostly based on real world situations. 

In addition, even though this book is not intended as a manual for any statistical computer 
package, readers can use computer packages to apply the techniques learned here. However, it 
is strongly recommended that one use high quality software only after readers thoroughly 
understand the statistical reasoning for the methods being used. 

The authors wish to give warmest thanks to Professor Carlos Brebbia who gave us the 
opportunity to get this material into print and to Mr. Brian Privett, head of production of WIT 
who has been waiting patiently for the manuscript's completion. We would like to thank 
Professor C.S. Chen for introducing the authors. We would also like to thank a number of 
friends, neighbors and colleagues in the Department of Mathematical Sciences at the 
University of Nevada, Las Vegas. Finally, we are grateful to our wives, Joyce and Sookhyun 
Ellen, who encouraged us to complete this project under difficult circumstances. 


Michael A. Golberg 
Hokwon A. Cho 


To our wives 
Joyce 
Sookhyun 


To our children 
Jonathan 
Stefany 
Katherine Soojin 
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Chapter 1 


Introduction 


Regression analysis is a collection of statistical techniques that serve as a basis for draw- 
ing inferences about relationships among interrelated variables. Since these techniques 
are applicable in almost every field of study, including the social, physical and biological 
sciences, business and engineering, regression analysis is now perhaps the most used of 
all data analysis methods. Hence, the goal of this text is to develop the basic theory of 
this important statistical method and to illustrate the theory with a variety of examples 
chosen from economics, demography, engineering and biology. To make the text rela- 
tively self contained we have included basic material from statistics, linear algebra and 
numerical analysis. In addition, in contrast to other books on this topic [27, 87], we have 
attempted to provide details of the theory rather than just presenting computational and 
interpretive aspects. 


1.1 A Brief History of Regression 


1.1.1 Genealogy of Regression 


A well-known British anthropologist Sir Francis Galton (1822-1911) seems to be the 
first to introduce the word “regression” in his study on heredity. He found that on the 
average, heights of children do not tend toward the parents’ heights, but rather toward 
the average as compared to the parents. Galton termed this “regression to mediocrity 
in hereditary stature.” In the Journal of the Anthropological Institute, Vol. 15 (1885), 
pp. 246-263, it says that “... The experiments showed further that the mean filial 
regression towards mediocrity was directly proportional to the parental deviation from 
it.” Galton then described how to determine the relationship between childrens’ heights 
using parents’ heights. Today Galton’s analysis would be called a “correlation analysis,” 
a term for which he is also responsible. In most model-fitting situations today, there are 
no elements of “regression” in the original sense. Nevertheless, the word is so established 
that we continue to use it. For more related stories about the history of regression, we 
refer readers to the Statistical Encyclopedia or The History of Statistics by Stigler (1986) 
[110]. 


2 CHAPTER 1. INTRODUCTION 


1.1.2 The Method of Least Squares 


As we shall see, the basic mathematical tool in regression analysis is the method of least 
squares. There has been a controversy concerning who first discovered the method of 
least squares. Generally, credit for the method of least squares is given to Carl Friedrich 
Gauss (1777-1855). Apparently Adrien Marie Legendre (1752-1833) seemed to work 
independently on its use in 1805. In brief, the idea of the least squares method is to find 
the values of the unknown constants in а hypothesized equation that minimizes the sum 
of the squared deviations of the observed values from those predicted by the model. The 
justification for this is given by the celebrated Gauss-Markov theorem, which is proved 
in Chapters 3 and 5. 

However, the thing we need to focus on is that regression analysis and the method of 
least squares have always been linked to practical use. 


1.2 "Typical Applications of Regression Analysis 


Since regression analysis is not just fitting equations to data, the crucial point is that for 
every problem in various fields of science one needs to clarify the goal of the problem and 
the methods needed for the regression analysis. The following is à summary to provide 
some guidelines. 


1.2.1 Use of Regression Analysis 


The main purposes of regression analysis can be summarized as: 
1. Data description - to investigate or refute a relationship among variables. 


2. Interpretation - to give а summary or an interpretation through the fitted model 
to obtain an interpolation or calibration curve/surface. 


3. Inference - to develop or improve the theoretical model(s)/method(s) which should 
be chosen to extend and generalize it to other sets of data. These formal statistical 
techniques are called estimation of parameters, testing and prediction. 


We assume that a set of data has been obtained. For practical application of the 
methodology, it is important to remember that regression analysis is а data-analysis 
oriented approach to problem solving. Hence, fitting an equation itself may not be 
the primary objective of the study. Furthermore, fitting an equation may only be an 
intermediate process to gain insight and understanding of the data. 


1.2.2 Data Sets 


An essential part of regression analysis is the data which have been collected. Perhaps the 
most serious limitation in a regression analysis is to fail to collect data on all potentially 
important regressors. It is fair to say that the results are only as good as the data that 
produced them. The basic methods of collecting data are: 


1. Retrospective study - using historical data or an existing census, etc. 


2. Observational study - through random observations. 


1.3. COMPUTER USAGE 3 


3. Experimental study - through designed/planned experiments or surveys. 


Even if many of the difficulties due to data sets are rather obvious, limitations in the 
data gathering process may prevent them from being analyzed appropriately. 


1.3 Computer Usage 


During the past twenty years advances in computer technology particularly in personal 
computers (PCs), and telecommunications such as the internet (World-Wide Web) have 
brought us revolutionary improvements in statistical research. Because of these devel- 
opments many of the traditional limitations in the computational aspects of regression 
analysis have largely disappeared. In consequence a large number of high quality sta- 
tistical packages have become available which have largely eliminated the drudgery of 
regression calculations so that students, practitioners and researchers can focus on the 
analysis of data rather than just calculating descriptive aspects. 

Аз a result, many new mathematical techniques have been developed for data analy- 
sis which would not have been feasible 20 years ago. In addition, most currently pop- 
ular packages such as MINITAB ~ , SAS~ and SPSS ~ are now available for PCs with 
user-friendly spreadsheet interfaces which make calculations easy to do. Of particular 
importance is the ease in which high quality graphical output can be obtained. Of course 
such tools should not be used uncritically, and it is one of the major goals of this text to 
make it possible for students to use these packages with a certain confidence that they 
understand their output. Because of its widespread availability, quality and ease of use 


we have chosen MINITAB™ to do most of the calculations and plots that appear in this 
book. 

Moreover, the internet has made it possible to obtain а large variety of additional 
material, such as data sets, lecture notes and free computing software which are useful 
adjuncts to the material presented here. Much of this can be obtained with a few simple 
mouse clicks using currently available search engines. We strongly advise students to 
make use of these capabilities. 


CHAPTER 1. INTRODUCTION 


Chapter 2 


Some Basic Results in 
Probability and Statistics 


2.1 Introduction 


Throughout this text we will assume that the student has had a basic course in probability 
theory and statistical inference such as that in Refs [40, 63]. Our preference is that such 
a course would be at a level requiring calculus and some background in matrix theory. 
Because some tools in matrix algebra are necessary for multiple regression, we will outline 
the necessary ones in Chapter 4. In this chapter we will present some of the standard 
results in probability and statistics that will be needed throughout the book. Those 
familiar with this material may skip this chapter and simply refer to it as necessary. 


2.2 Probability Spaces 


A probability space (also called a sample space) is a set €) together with a collection of 
subsets А of €) called events. If A is an event, the probability that А occurs is written 
as P(A) and for short, is usually read as the “probability of the event A.” As is well 
known, P (A) has the following properties: 


(1) 0 

(2) P(Q) = 1; 
(3) P 

(4) P 


eI 


(¢) = 0, where ¢ is the empty or impossible event; 


(AU s P(A) + P(B) - P(An B), where AU B is the event that “either A or 
B occurs” and AN B is the event that “both A and B occur.” If A and B cannot 
occur simultaneously (AN B = $), then P (AN B) = 0 and P (AU B) reduces to 


P(AUB)- P(A) +P(B). (2.1) 


In this case we say that А and B are mutually exclusive events. 
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Similarly, more complicated rules exist for calculating P (A1 U Ag U---U A4) which 
may be found in [40, 63]. For example, if A;,i = 1,2,..., are mutually exclusive; 
1.е., 


then " 
P(A, UAU: UAn) = $ Р(А;). (2.3) 
j=l 


(5) The event A denotes the complement of A and is equivalent to saying that “A does 
not occur.” Then it follows from (2.1) that 


P(A) =1- P(A). (2.4) 


(6) If an event B is known to have occurred, then we may regard B as our new sample 
space and then “the probability of A given that B has occurred” (then both A and 
B have occurred) is called the conditional probability of A given B and is denoted 
by Р(А|В) and is given by 


P(An B) 


P(AIB) = —Ergs 


, P(B) £0. (2.5) 


(7) Let the sample space be partitioned into k mutually exclusive events Ву, B5, ..., Bk 
such that P (Bj) > 0,j = 1,2,..., К. Also let A be another event such that P (A) > 
0, that has occurred in the sample space. Then 
А= An(B1UB4U---U B,) 
= (An BJU (AN B5)U---U(An Вк). (2.6a) 


However, P (AN Bj) = P(B;) P (A|B;), j = 1,2,..., k using (2.5). So 
P (A) = P (B1) P(A[B1) + P (B2) P (A|B2) +++ P (Bk) P (А|Вь) 


k 
=} Р(В)Р(А|В,). (2.6) 


This is called the law of total probability. Using (2.5) and (2.6b) we have 


TET a ko B 
PRIA рд) Y P (Bj) P(A[Bj) 


which is known as Bayes’ theorem. The conditional probability P(B;|A) is ex- 
pressed as a function of the simple probability of B;, Р(В;), which in the absence 


of information about A, is called the prior probability. On the other hand, P(Bi|A) 
is called the posterior probability. In particular, for Ё = 2, this reduces to 


E Р(АІВ:)Р(В.) 
POM) = BOB P(Bi) + P(A|Ba)P(Ba) 


As we see, this describes how to revise the prior probability of the event Bı in the 
light of additional information to yield the posterior probability. 
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If the occurrence of B has no effect on the occurrence of A then, P(A|B) = P(A) 
and then (2.5) becomes 

P(AQB)=P(A)P(B). (2.7) 

In this case we say that A and B are independent events. For more than two events 


the conditions for independence are more complicated. For example, for three events 
A, B,C, they are independent if and only if (iff) 


P(AN В) = P(A) P(B),P(ANC) = P(A) P (C),P (BAC) = P(B) P(C), (28a) 


and 
P(ANBNC) = P(A) P(B)P(C). (2.8b) 
It is important to note that (2.8a) does not imply (2.8b) nor does (2.8b) imply (2.8a). 
For n events A;,1 < i < n, they are independent if and only if for any subset of k events 
A;,,2 <j € k that 


PO TAS Cie ds) PUR) POS) PAS). (2.9) 


In particular, 
P(Ain Ag---O0 An) = P(A) P (A2) P(An). (2.10) 


Equation (2.10) will be used repeatedly throughout the text. 


2.3 Random Variables 


Without getting into technical details, a random variable may be regarded as a real- 
valued function on a probability space. We will distinguish between two types of random 
variables (often abbreviated as r.v.) discrete and continuous. A discrete random variable 
is one which takes at most a denumerable number of values (often, just finitely many) 
while a continuous random variable may take on a continuum of values. 

The set of values a random variable can take on is called its range. Typically, we 
will denote a random variable by upper case letters X,Y, Z etc. and the range of X 
will be denoted by R(X). For example, if we roll a die and X denotes the number of 
spots observed on top, then X is discrete and R(X) = {1,2,3,4,5,6}. On the other 
hand, if X denotes the value of a number chosen at random from the interval [0, 1], then 
R(X) = [0,1] and X is a continuous random variable. 

Often in practice, if a random variable which is truly discrete, but can take on a 
very large number of values, it is customary to use a continuous random variable to 
approximately model the discrete random variable. For example, if we consider the 
amount of money a person can earn in a given year (in dollars), then strictly speaking 
R(X) = {0,1,2,...}. However, it is often useful to approximate this random variable by 
one that can take on all values in [0, oo). This is quite common in statistical analysis 
and is usually done without comment where necessary. This convention will be followed 
throughout the text. 


2.4 The Probability Distribution of X 


If a random variable is discrete and z; € R(X), then 


fx (ti) = P{X =},1<1< оо, (2.11) 
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denotes the probability that X takes on the value z;. The sequence of values {fx (21), 
fx (22), ..., fx (24),...) is called the distribution of X and is usually abbreviated as 


fx (x), x € R(X). (2.12) 
Often, fx is referred to as the probability mass function of X. 


Example 2.1 Consider an experiment of tossing a fair coin twice. Let X be the 
number of heads observed. The sample space is S = (HH, HT, T H, TT), then X can be 
either 0 or 1 or 2. Assuming the tosses are independent, we have the probability mass 
function of X as follows; 


т 0 1 2 
Р(Х ==) | 1/4 1/2 1/4 


More generally, the probability that X takes оп a value between а and b, is denoted 
by 


P{a<X <b} (2.13) 
and is given by 
P{asX<b}= У fx (а) (2.14) 
а<т<Ь 


where fx (2) = 0 if z € R(X) so the sum in (2.14) has at most countably many values. 
When X is a continuous random variable it will usually be the case (at least in this 
book) that 


Plac x «i f fx (т) dz (2.15) 


where fx (x) in this case is called the probability density function (pdf) of X. 
In particular, 


Pxsa-[ fx (x) dz = Fx (х), (2.16) 


is called the cumulative distribution function (cdf) of X (distribution function for short). 
By the fundamental theorem of calculus 


fx (т) = “Fx (z). (2.17) 


2.5 Some Random Variables and their Distributions 


For future reference we give a number of examples of commonly occurring random vari- 
ables and their distributions. Further examples will be given in Section 2.8. 


Example 2.2 (Bernoulli random variables) Many experiments can be described by 
observing one of two possible outcomes. For example, a coin toss can result in a “heads” 
or “tails”, a poll question is answered “yes” or “no” a medical treatment can be a 
“success” or “failure”. In this case we can take our sample space Q = (S, F} and 
define a random variable X which counts the number of “successes.” Then X (5) = 1 
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and X (F) = 0. If the probability of success is p, then P {X —1) = fx (1) = p and 
P(X =0} = fx (0 =1-р= д. Then R(X) = {0,1} and 


Ге (а) $501 (2.18) 


is the distribution of Х. А random variable having the distribution in (2.18) is called a 
Bernoulli random variable. 


Example 2.3 (Binomial random variables) If the experiment in Example 2.2 is re- 
peated independently n times, then the sample space can be described by the Cartesian 
product О” = (S, Кү” of О taken n times. О?” is the set of all ordered n-tuples of S's 
and F's and has 2" elements. If (z1,22,..., 74), 2; = S or F,1 < i € n, is an element 
of О", then by independence P {(x1,22,...,2n)} = p*q"-* where k is the number of 
successes in the sequence (£1, T2, ..., Zn). Now let X : € — В be the random variable 
which counts the number of successes in (11,22, ..., 24), then R(X) = (0,1,2, ..., n) and 
standard counting arguments show that the distribution of X is given by 


fee Dyer оаа, (2.19) 


where 


(") Е уя (2.20) 


is called the binomial coefficient. (For n a positive integer, n! = п (п —1)---2-1 and 
0! = 1.) 

The distribution (2.19) is usually called the binomial distribution and X is called а 
binomial random variable. 


Example 2.4 (Poisson random variables) As we mentioned previously, even when a 
random variable has a finite range, it is often convenient mathematically to approximate 
it by a random variable with an infinite range. An important example of this occurs 
when X is a binomial random variable when p is small and n is large. In this case, if we 
let A = np, then (2.19) can be approximated by 


—AwXT 
n! gue ETSA 


——— p Fy —— 2.21 
(n х)” H т! a 


If we now let X take on values {0,1,2,...} = Q and define X: О > R by X(z) = = 
and 


eA" 
еа) =Р{Х =тү= 2175 z 0, (2.22) 
then fx (z) 2 0 and 
CNET ny 
>, pne а e=] (2.23) 


so that fx (т) is a probability distribution on О. This distribution is usually called a 
Poisson distribution and the corresponding random variable a Poisson random variable. 
The Poisson distribution is often called the distribution of rare events and is often used 
in statistical analyses to describe the occurrence of events occurring randomly in time. 
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Example 2.5 (Uniform random variables) To model an experiment where there are 
n equally probable numerical outcomes we let Q = {21,22,..., 24] where z; is the i-th 
outcome, 1 < i € n. Let X : Q — R be defined by X : Q — R by X (х) = т gives the 
value of the i-th outcome, then 


fx (zi) = PIX = zi] = =, 1<i<n, (2.24) 


and X is called a uniform random variable and the corresponding distribution a uniform 
distribution. 


If there are an infinite number of equiprobable outcomes, then one cannot use a 
discrete random variable to model such situations. In this situation a reasonable proba- 
bilistic model is to assume that the outcomes can occur in a finite interval of real numbers 
—co <a < b < oo. In this case we take Q = [a,b] and define X : 9 — R by X (x) = =, to 
describe the outcome of a point x chosen at random from [a,b]. Of course X is not a dis- 
crete random variable so we cannot use fx (x) = P (X = x) to describe its probabilistic 
properties. 

Here, we need to modify our approach to assigning probability measures - we cannot 
begin by assigning probabilities to points, rather we must begin by assigning measures to 
intervals. To model the notion of an equiprobable choice of “points”, intervals of equal 
length should be equally likely to occur. Thus, if A = [c, d] С [a,b] we define 


P (A) = 


d—c 
p (2.25) 


Thus, P(A) 2 0 and P (Q0) = (b — a) / (b — a) = 1. The measure defined by (2.25) is 
called the uniform measure on [a,b] and then 


d— 
P(c < X «d) - P([o.d] = + : (2.26) 
From (2.26) it follows that the distribution function of X is given by 
0, r«a 
Fx (г) = -= a<r<b (2.27) 
1, 4 >b. 
Differentiation of (2.27) gives 
0, т<а 
/ eg (т) = Е <2 < (2.28) 
ХМ оте = mE а< 2 < А 
0, x >b, 


fx (x) is called a uniform density. 


Example 2.6 (Canonical random variables) It is important to observe from Example 
2.5 that the distribution of a random variable X is inherited from the measure that 
is imposed on the underlying probability space. Hence, given a continuous function 
Е (т): R — R having the properties: 
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(i) O< F(z) <1; 
(ii) F (x) is a nondecreasing function, i.e., F (x) < F (y) for z < y; 
then it can be used to define a probability measure on R by defining 
P(a< X <Б) = F(b) — F(a). (2.29) 


If we define the random variable X : R — R by X (x) = =, then it follows from (2.29) 
that the cdf of X is given by 


Fx (х) = P(X € x) = F (z). (2.30) 


Thus, given a continuous function satisfying (2.30) we can always define a random variable 
X having the cdf F (х). This random variable is called the canonical random variable 
associated with a given cdf F (x). This justifies the common practice in probability 
theory and statistics of referring to random variables and distributions interchangeably. 
We shall follow this custom throughout the text. 

If F (x) is differentiable, then 


dF (х) 
= 2.31 
f) = =, (2.31) 
is called the density of F (x). Again, by the fundamental theorem of calculus, 
F(x) = / / (т) das (2.32) 


Thus a random variable may be expressed in terms of its density since the density of the 
cdf of a canonical random variable X satisfies 


fx (ш) = f(z). (2.33) 


Hence, we may refer to a continuous random variable by its density and we shall do so 
without further comment. 


Example 2.7 (Logistic random variables) Let 


-œ < T <, (2.34) 


then it is easily verified by making the change of variable u = е that 


f f (x) dz = 1. (2.35) 
Since f (x) > 0, f (x) is a density function with cdf 
et 
= ——, — : 2. 
F (x) mem œ € p x OO (2.36) 


The density function given by (2.34) is called a logistic density and a random variable 
having a logistic density is called a logistic random variable. 
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It is often convenient in probability theory and statistics to be able to classify random 
variables whose densities are of seemingly different forms into families which have a com- 
mon structure. Ап important family in modern statistics which contains the binomial, 
Poisson and normal random variables, are the ezponential densities which have the form 


f (2,0) = exp [a (x) b (0) + c (0) + d (z)] (2.37) 


whose т can take on a discrete or continuous range of values independent of 0. This 
family gives rise to the generalized linear model of statistical models which contain as a 
particular case, the normal models, which will be the primary focus of this text. 


Example 2.8 (Exponential family of random variables) As particular cases of (2.37) 
we show that Poisson and binomial random variables are members of the exponential 
family. In Section 2.8 we show that normal random variables are as well. 

If X is a Poisson random variable, then its density is given by f (x) = exp (—A) А /2!. 
Now А? = exp(zlogA) so that exp(—A) M /z! = exp(rzlogA— А – logz!). Thus, if 
0 = А, а(х) = =, b(0) = log0, c(0) = —0, d(x) = —logz!, f (250) is of the form in 
(2.37) so a Poisson density belongs to the exponential family. 

Similarly, for a binomial density one has 


f (25р) = Ge (=p 
— exp [zlogp - zlog( — p) + nlog (1 — p) + log C) | 


So, with 0 = p, a(x) = т, b (0) = log [@/ (1 — 0)], c(@) = nlog (1 — p) and d (zx) = log (7), 
f (x; р) is of the form in (2.37) so is a member of the exponential family. 


2.6 Joint Probability Distributions 


Often we will consider many random variables simultaneously. The joint probabilistic 
behavior of n random variables X;,1 < ? < m, is generally described by their joint 
distribution function 


Fx (21,29, ВЫ) = РХ; < Z1, urbe < т} (2.38) 


where the comma in (2.38) indicates intersection and X is shorthand for the n random 
variables X;,1 <i < n. If all the random variables are continuous (strictly speaking 
absolutely continuous) then Fx (x1, 22, ...,24) can be given as a multiple integral of their 
joint density fx (21,22,..., 24). For example, if X; and X» are two continuous random 
variables, then 


Fx (21,02) = | i fx (t1, t2) dtıdt2. (2.39) 


As for (2.39) one can recover fx from Fx by differentiation; i.e., 


БЫ 
fx (21,22) = ara t (21,22) (2.40) 
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and more generally 


д" 
fx (£1, do En) = Ond ced (£1, E2, eda); (2.41) 


If the X’s are discrete, then Fx is given by summation of the joint discrete density 
fx (£1, 22,..., £n). For example if X; and X; are two discrete random variables, then 


Fx (21,22) = x yr fx (01,7). (2.42) 


Yy2S22 yi S21 


In the discrete case опе can recover Fx from fx, but the process involves technical 
limiting arguments, which fortunately we will not need. In general, one usually specifies 
the joint behavior of continuous and discrete random variables by specifying their joint 
densities; the joint distributions Fx are then obtained by integration or summation. 

If one knows the joint distribution or joint density of n random variables, then the joint 
distribution or joint density of any subset of these can be obtained by partial integration 
or summation. For example, if X; and Х» are two continuous random variables, then 
the density of X, is given by 


fx, (21) = / fx (21, 22) dz2 (2.43) 
and the density of X5 by 
fx, (22) = / fx (21,22) dzi. (2.44) 
If X1 and Х are discrete random variables, then 
/х\(т)= у, fx (21,22) (2.45) 
r2ER(X2) 
and 
Íx,(23)— У fx (21,22). (2.46) 
Z1 €R(Xi) 


The distributions obtained by summation or integration of the joint distributions are 
usually referred to as the marginal distributions or densities of the joint distribution and 
by definition these are known once the joint distributions are known. In general, one 
cannot reverse this process. That is, knowing the marginal distributions of n random 
variables, one cannot reconstruct the joint distribution or densities. Put another way, 
there may be many joint distributions which have the same marginal distributions. 


Example 2.9 Let X; and Xə have the joint pdf 


102122, б<хжу<хжо<1 


f (1,23) = | 0, elsewhere. (2.47) 


The marginal pdf of X, is 


2 10 
fx, (ж) = f 102122412 = 37 (1- x), 0< 25€ 1, (2.48) 
21 
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zero elsewhere, and the marginal pdf of Хә is 
fx; (ә) = | 10212241 = 511, 0 < zə < 1, (2.49) 
0 


zero elsewhere. Furthermore, the conditional density of Хә given Xj = 21, f (22|21) is 
given by 
_ f (21,22) A 102122 За2 


f (29121) = f G) = Wz, (123) = (1-23) О<2у < 29 < 1. (2.50) 


In contrast to Example 2.9 there is ап important situation where the marginal dis- 
tributions determine the joint distribution and that is when the random variables are 
independent. 


Definition 2.1 Let X;,1 < € n, be random variables. We say that Х;,1 « i < п, 
are independent random variables if and only if the joint distribution Fx (£1, 5, ..., 24) 
factors as a product of n distributions Сх, (2;),1 <i < n. That is, 


Fx (21,22, 24) = [| Gx, (22); (2.51) 


i=l 


where Сх, (z;),1 € i € n, are distribution functions. That is, 


о ei Tei (2.52) 
and 
lim Сх, (ж) = 0 and lim Gx, (zi) = 1. (2.53) 


From Definition 2.1 it follows that if the factorization in (2.51) holds, then Сх, (zi) = 
Fx, (zi),1 € à € n, are the marginal distribution of X;,1 € i < n. We consider the case 
for n — 2. 

From (2.53) it follows that 


um Fx (21, 22) = Fx, (21) (2.54) 
and 

um Fx (11,22) = Fx, (£2). (2.55) 
By independence, 

jim, Fx (£1, T2) = Gx, (21) (2.56) 
апа 

,um. Fx (z1, £2) = Gx, (22) . (2.57) 


It then follows from (2.54)-(2.57) that 


Fx, (zi) = Сх, (2;), і = 1,2. (2.58) 
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Using this ір (2.51) it follows that Х;,1 < 4 < n are independent if and only if 
т 
Fx (£1, £2, = [| Ех, (zi); (2.59) 
i=l 


that is X;,1 < à < n, are independent if and only if their joint distribution factors 
as a product of their marginal distributions. Hence, if we know that Х;,1 < i € m, 
are independent random variables, then one can reconstruct their joint distribution by 
multiplying together their marginal distributions. 

When all the random variables X;,1 <i < n, have the same distribution, we say that 
they are identically distributed. In addition, if they are independent, they are usually 
referred to as independent and identically distributed (ii.d.) random variables. For 
statistical purposes we shall say they are a random sample of size ть of a random variable 
X. For computing purposes, it is generally more convenient to work with the joint 
densities. In this case an equivalent definition of independence (for both continuous and 
discrete random variables) is that X;,1 € 4 € n, are independent if and only if their joint 
density factors as a product of all their marginal densities i.e., 


fx (21,22, 24) = | [ fx (22. (2.60) 


ї=1 


Example 2.10 (Joint distributions and independence) Let X; and Х» have the joint 
pdf 


_ | ee xri 
f (21,22) = 0, elsewhere. (2.61) 
The marginal distribution for X, is the integral of this on т»: 
oo 
fx, (231) = | go uda =e. for zı >O. (2.62) 
0 


Moreover, the marginal density of x2 is e ??, for x2 > 0, so that 


f (21,22) = fi (21) f2 (22), 


which implies that X; and X2 are independent. 


2.7 Expectation 


One of the most important operations in probability and statistics is that of finding the 
average value of a random variable, usually called its expectation. The expected value of 
X is denoted by E (X) and the formulas for its calculation are given by 


> ех) xfx (z), if X is discrete 


Е(Х) = 
S zfx Gs if X is continuous. 


(2.63) 


'The relation in (2.63) reflects an important rule. That is, most expectation formulas 
for discrete random variables involve summations while those for continuous random 
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variables can be obtained by replacing summations with “integrations.” The general 
rules for calculating remain the same in both cases. 
If g (X) is a function of X, then 


аец) 9) /х (ж), if X is discrete 


Ejig( X) 2.64 
aad g (x) fx (x), if X is continuous. ш 
2.7.1 Moments 
If a is a real number, then 
E[(X ~ а)" (2.65) 


is called the n-th moment of X about а. When a = 0, these аге called the moments of 
X and if a = E(X), they are usually called the central moments of X. In particular, 
when а= E (X) = ux and n = 2, 


E|(X – ux | (2.66) 


is called the variance of X and is denoted either by Var (X), o? (X), o% or just о2. The 
square root of c?, с is called the standard deviation of X and is the most commonly used 
measure of dispersion of X. That is, it measures how much on average, X differs from 
its mean value E ( X). 

There are a number of important properties of E (X) and Var (X) that we will use 
repeatedly throughout the text and are stated below for the convenience of the reader. 


(1) If a and b are rea] numbers, then 


E (aX +b) 2 aE(X) +b. (2.67) 


2) If X;,2 = 1,2,...,n are random variables each having an expectation, then 
5 


1=1 i=1 


(Properties (1)-(2) are often referred to as the linearity properties of E (X).) 


(3) If X; i = 1,2,...,n, are independent random variables, then 


E(XiXa Хь) = E(Xi) E (X2) E (Xn). (2.69) 


For two random variables X and Y a measure of the relation between X and Y is 
the covariance, Cov(X, Y), defined by 


Cov(X, Y) = E[(X — их) (У — ну)]. (2.70) 


We note that if X and Y are independent, then Cov(X,Y) — 0. More generally, if 
Cov(.X, Y) = 0, we say that X and Y are uncorrelated. It is important to observe that 
while independent random variables are uncorrelated, that in general random variables 
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can be uncorrelated but not independent. In fact, Y can even be a function of X and 
still be uncorrelated with X. This points out that the notion statistical and functional 
dependence can be quite different. 


Example 2.11 Let X; and X? have the joint density given in the following Table: 


0 1/4 1/4 0 


1/4 0 о 1/4 
fx, (21) | 1/4 1/4 1/4 1/4 1 


From the marginal densities, E(X1) = 5/2, E(X2) = 0 and E(XıX2) = 0, which 
shows that the covariance is zero. However, it is easy to verify that X; and X> are not 
independent. In fact, Ху = X2. That is, the value of X; completely determines the 
value of X3. 


А normalized version of Cov( X, Y) is the correlation coefficient; 


p(X,Y) = а (2.71) 


An important property of р is that 
—-l<p<l (2.72) 


with p = +1 if and only if X and Y are linearly related. Hence, values of p close to +1 
suggests an approximate linear relationship between X and Y, while values of p close to 
zero suggest little or no linear relation between X and Y. This property, as we shall see, 
is fundamental to interpreting much of the results of regression analysis. In particular, 
if X and Y are jointly normally distributed, then p = 0 if and only if X and Y are 
independent. 


Properties of Var(X) 


From its definition o? (X) > 0 and it can be shown that o? (X) = 0 if and only if 
P{X = ux) = 1. Furthermore, 


(1) Var (aX) = a?Var(X); 
(2) Var (X + Б) = Var(X); 
ie =) РЕТ": "T ‚ОХ. 
(3) Var ons ах) = Э а Уат(Х) + УЗ a;a5Cov( Xi, X;). 
If X;,7 = 1,2,...,n, are pairwise uncorrelated, then (3) simplifies to 


(4) 
Var bs ax) = $ a? Var(X). (2.73) 
i=1 1=1 


In particular, if X;,1 < i € n, аге n independent random variables, then (2.73) 
holds. 
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Finally, we note some further properties of the covariance. 
(1) Cov (X, X) = Var(X) 


(2) If X;, 1 <i € n and Y;,1 € j < m are random variables, then 


Cov 3 Ый = 3 x aped (X. (2.74) 
i=1 j=l 


i=l j=l 


2.7.2 Moment Generating Function 


Although moments can often be calculated by summation or integration, these calcu- 
lations are frequently done more conveniently by a number of indirect approaches. An 
important technique in this regard is to use the method of generating functions. Here 
we discuss one class of generating functions, moment generating functions (mgf) which 
are useful not only for the computation of moments, but for determining properties of 
distributions as well. They will play a key role in our discussion in Chapters 4 and 5 of 
statistical properties of least squares estimators of regression models. 


Definition 2.2 Let X be a random variable. The moment generating function of 
X, Mx(t) is defined by 
Mx(t) = E (е) (2.75) 


provided that sum or integral required in (2.75) exists. 


As noted in the definition of Mx(t) not all random variables have a well defined 
moment generating function and if Мх (+) exists, it will only do so for an interval of 
values of t about t = 0. Since we will be mostly concerned with normal random variables, 
this technical problem will not arise in this text. 

If X is a discrete random variable, then 


Mx(t) = У? e fx (х), (2.76) 
rcR(x) 
while if X has a density, then 
Mx(t) = I e Fx (2) dz. (2.77) 


By formally differentiating (2.76)-(2.77) we find that 


TL 


e s (t) 


E = E(X") (2.78) 


t=0 


so that the n-th central moment can be obtained by differentiating M x(t) n times at 
t = 0. In this sense, Mx(t) generates the moments of X. Several examples of the 
usefulness of this idea follow. 


Example 2.13 Let X be a binomial random variable, then 


Mx(t) = (pet +q)”. (2.79) 


2.7. EXPECTATION 19 


From (2.19) and (2.75) 


мх@ = Ye (Prat = У (7) pe 7. (2.80) 


2=0 


From the binomial theorem it follows that 
= n tT п-т __ t n 
>D C) (pe) q = (ре + д) (2.81) 


giving (2.79). 
Differentiating (2.79) we can obtain formulas for E (X) and Var(X). In fact, 


d mZ 
E(X)= &Mx(t)) = npet (p +q) | = np (2.82) 
dt D #=0 
since p+ q = 1. 
Also 
a d 
EB (X*) = —Mxit 
(09) = Ol 
= пре! (pe! + "| + n(n = 1) p? e?! (pet Д шы s 
=np+n(n—1)p*. (2.83) 
From (2.66) 
Var(X) = E(X?) — [Е(Х)] (2.84) 
so that 
Var(X) = пр+п(п – 1) р? — n?p? = np — np? = npg. (2.85) 


We have it as an exercise to compute E(X) and Var(X) by direct summation. 


Example 2.14 A random variable X is said to have a standard normal distribution 
if its density is given by 


f(z) = (21) !/?е-®/?, Loo < = < оо. (2.86) 
The moment generating function of X is given by 
Mx(t) = е. (2.87) 


To prove (2.87) we will make use of the well known integral 


1 me 2 
—z"[2 EN 
—— € dz =1 2.88 
М/2т Jos ( ) 
which establishes that indeed, fx(x) is a probability density. 


Then, 


oo oo 1 
Mx(t) = / е f Jue Hd (2.89) 
T 55 T 


—co 
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Now, 
z^ 1,2 1,2 2—22 
=ске ce -—2@)=—(@ — 2tz +t — t^) 
1 2 1^ 


Using (2.90) in (2.89) gives 


о 1 2 2 
Мх (t) zi Bee зай. /2 da 
= ef /2 I dey (2.91) 
—oo V 2T 


Making the substitution y = x — t and using (2.88) gives (2.87). 
Using (2.87) one can easily compute E (X) and Var(X). From (2.78) and (2.88) 
d 2 /2 


- e| =0 (2.92) 


so that 


2 2 
Var(X) = E(X?) = Te - 


t=0 


= ef /? + е? „=> (2.93) 
t= 


t=0 


In addition to facilitating the computation of moments, moment generating functions 
are useful in studying the distribution of the random variables themselves. This follows 
from the fact that the moment generating function determines the distribution of X 
itself. 

Specifically, if X and Y are two random variables, and Mx (t) = My (t) for t € (—a,a), 
then X and Y are identically distributed. This relation is particularly important for 
determining the distribution of sums of independent random variables. To see this, 
suppose X; and X» are independent random variables, then 


Meus ed jets X = E [ее]. (2.94) 
By (2.69) and (2.75) 
E [e еі) = E (е) E (ех) = Mx, (t) Mx, (t). (2.95) 


As an example, suppose that X;,2 = 1,2 are independent Poisson random variables with 
parameters А;,і = 1,2. Then Хү + Хә is a Poisson random variable with parameter 
Ay + А. 

We begin by finding the mgf of X;,i = 1,2. From (2.76) and (2.23) 


pe АР Lx S (et AQ? 
MU (e sere M 
z—0 


т! 


c0 


= eTe i= 1,9, (2.96) 
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From (2.94) 


Meet) = (ешн) Си 

= е Arta) е(Аа+А)е*, (2.97) 

Letting Аз = Ау + А (2.97) is the mgf of a Poisson random variable Y with parameter 
Аз = А Aa. Since Y and X, + Хә have the same mgf, they have the same distribution. 


Property (2.95) extends easily to the situation where X;,1 <i < n, are n independent 
random variables, and Sn = У), X;. Then 


Ms, (t) = | мх, (9. (2.98) 
i=l 


In Section 2.8 we will use (2.98) to establish the important fact that the sum of n 
independent normal random variables is a normal random variable. 


2.8 Тһе Normal and Related Random Variables 


In statistical theory and in regression analysis in particular, normal random variables 
play an important role. In this section we will develop some properties of these random 
variables and several others, the x? (chi-square), T and F random variables which are 
derived from them. 


2.8.1 Normal Random Variables 


А random variable X is said to be normally distributed with parameters (р, о?) - written 
as X ~ N (и, с?) - if X has a density of the form 


/х (ж) = = exp Е (= — "d , —00 < T < оо. (2.99) 


It can be shown that normal random variables have many important properties. Among 
these are: 


(1) If a 40, and X ~ N (1,07), then 


aX -- b^ N (au +, a?g?) : (2.100) 


(2) If Xj, 1 € i € n, are independent random variables and X; ~ № (m o2), 1026 9, 


then B | А 
УХ; RAN 63 У), (2.101) 
ї=1 i=1 1—1 


(3) (Central Limit Theorem) Although (2.101) is an exact relation when the X; ~ 
N (и, о?) and independent, under very general conditions 


n т 7 
У’ G; Xj is approximately № > Gili, MP set) ; (2.102) 
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provided that n is sufficiently large. (The X’s do not have to be independent for 
this to hold.) This is one of the most remarkable theorems in probability theory, 
usually referred to as the Central Limit Theorem (CLT). Full details of proofs and 
conditions under which the theorem holds can be found in Refs. [63, 89]. 


(4) From (2.99) we find that 
Z=(X – џ)/о (2.103) 


is N (0,1). 2 is usually referred to аз a standard normal random variable. Con- 
versely, if Z ~ N (0,1), then oZ + u ~ N (р, 0°). 


From Table A.1 one can easily see that if Z ~ N (0, 1), then 
P {—30 < Z<30}~1 (2.104) 
and from (2.99) that for a N (џи, с?) random variable 
Р{р- 30 € X € u-- Зо} = 1. (2.105) 


Thus, it is highly unlikely that an observation of a normal random variable is more than 
three standard deviations from its mean. In fact, is it quite unlikely that the values of a 
normal random variable are more than two standard deviations from p since 


Р{р- 20 € X € р+ 20} = 0.95. (2.106) 
This fact is the basis for many “rules of thumb” in statistical analysis. 


(5) The normal distribution is a member of the exponential family. 


2.8.2  Chi-Square Random Variables 
If X ~ N (0,1), then the random variable Y = X? has the density 


2my) |/? e-v/? 0 
fv w)={ cuu i о (2.107) 


The random variable Y is said to have a y?-distribution with one degree of freedom (df). 
If Y;, 1 € i € n, are independent x? random variables with one df, then their sum 


x? (n) = у Y; (2.108) 
i=l 
has the density 
1 
—2/2„(/2)—1 > 0 
fain (т) = { XT 7 c 777 (2.109) 
0, gx 0. 
where Г (о) is the well known gamma function defined by 
Г (о) - | r^ ledy, a > 0 (2.110) 
0 


and n in (2.109) is a positive integer again called the degrees of freedom (df) of the random 
variable x? (п). A number of useful properties of x? random variables are summarized 
below. 
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(1) If x? (п;),1 € i € m, are independent x? random variables with n; df, then their 


sum 
т 


x? (п) = x n), n= Ут (2.111) 


i—l 
is a x? random variable with n = УУ", n; degrees of freedom. 


(2) If X is x? (n) and Y is x? (m) and X = Y + Z with Y and Z independent random 
variables, then Z = X — Y is x? (n — m) if n >т. 


(3) E [x? (n)] = n and Var |x? (n)] = 2n. 


(4) If n is large, then 
[x? (n) — n] / V2n (2.112) 


is approximately N (0, 1). This follows as а consequence of the Central Limit The- 
orem and the representation given in (2.102). 


(5) If Х;,1 € < n, are independent and N (u,o?), then (Xn = + У, Ху) 
$8 2— 3 (х,-Х„) (2.113) 


is a х? random variable with n — 1 degrees of freedom. This is an important result 
in classical statistics because it shows that the sample variance 


82 = Sce (2.114) 


of n independent N (и, о?) random variables has the distribution of a с??? (n — 1)/ 
(n — 1) random variable. In particular, it follows from (2.114) that E (s?) = о?. 
Hence, in statistical terms (see Section 2.10) s? is an unbiased estimator of the 
population variance. We will show in Chapters 3 and 5 that this result has an 
important generalization in estimating the error in regression analysis. 


2.8.8 t and F-Distributions 

Two other important classes of random variables closely related to the normal are the T' 
and F random variables. 

t-Distribution 


А T random variable has the density 
Е T [(n + 1) /2] 1 
Г (n/2) упт (1 + z2/n)* 072" 


where n is a positive integer called the degrees of freedom. In classical statistical theory 
T random variables occur naturally in the estimation theory associated with the mean 


—oo < T < oo (2.115) 


t (z;n) 
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p of a normal random variable. As we shall see, they play a similar role in regression 
analysis. 

An examination of Table A.2 suggests that as n — oo the t density is approximately 
N(0, 1). In fact, with some tedious algebra it сап be shown that 


1 1 
lim t(x; n) = exp | ——т° |, -0 < z < осо. 2.116 
jim, t(n) = ep (- 52") (2.116) 


We will find this result useful in regression analysis because it can be used to give an “eye- 
ball test” for the significance of regression coefficients when the number of observations 
is large. 

T random variables have their origin in classical statistics. It is well known that if 
X;,1<i<n,is a random sample of size п of a N (и, о?) random variable, then 


(Xn — ш) / (с/уп) 

VER 
has a t density with n — 1 degrees of freedom. More generally, if Z ~ N (0,1) and x? (m) 
is a chi-square random variable with m df and independent of Z, then 


Z 
qiti (2.118) 


v X? (m) /m 


has a (Student's) t-distribution with m degrees of freedom. Again, this result, generaliz- 
ing (2.117) will play a significant role in regression analysis. 


T= (2.117) 


F-Distribution 


If x? (n) and x? (m) are independent chi-square random variables the random variable 


2 
Pe n (2.119) 
has the density 
n/2 n/2—1 
f (zin, m) = os П + (n/m) CP, 250, (2.120) 
0, т<0, 
where ; 
B (o,8) = | 2) (1 — zy?! de (2.121) 


is the beta function. Such random variables are called “F” random variables. (Note: A 
chi-square random variable divided by its degrees of freedom (df) is often referred to as 
a mean square. Hence, an F random variable is the ratio of two mean squares.) Again 
the integers (n,m) are called the numerator and denominator df respectively. In the 
classical normal statistical theory F random variables occur when one needs to test for 
the equality of variances. A similar role occurs in regression analysis. 

When m is large, then F is approximately х? (п). This can be helpful in finding 
approximate values for critical points of the F density when the denominator df m is 
large. This situation is of frequent occurrence in regression analysis because m is a sample 
size which generally is considerably larger than n. 
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2.8.4 Lognormal Random Variables 


Historically in statistics one often chooses normal random variables to describe data 
that have a reasonably symmetric density. This choice is frequently justified (without 
analysis) on the basis of the Central Limit Theorem. However, increasingly one is faced 
with long-tailed asymmetric data such as stock prices, income levels and housing prices. 
Frequently, it is argued that by taking the logarithm of the data that the transformed data 
will be approximately normally distributed. This argument suggests that one consider a 
random variable Y such that 


X =logY or Y =e* (2.122) 
where X has а N (и, c?) distribution. From (2.122) it is easily shown that Y has a 
density given by 


l e : (lo уй > 0 
Хр | -53 = › 

fy бу) = М2тсу P 20? noon 4 (2.123) 
0, y <0. 


In this case Y is called a lognormal random variable. Although the density is similar in 
functional form to the normal density, the reader should be cautioned that и Æ Е (У) 
and c? Æ Var (Y). In fact, it can be shown that 


E (Y) = exp (u+07/2) (2.124) 


and 

Var (Y) = exp (2u + о?) [exp (o°) — 1]. (2.125) 
Note from (2.124) that E(Y) > e" which is the median of Y ; a result which reflects the 
asymmetry of the distribution of Y. 

In regression analysis logarithmic transformations of the data are frequently taken 
(Sec. 3.10) in the hope that the resulting data is normally distributed. This leads to 
random variables Y that are lognormally distributed. This type of assumption is often 
used when trying to model data that changes by some multiplicative process over time, 
such as population, prices, income etc. 


2.9 Statistical Estimation 


As we shall see, regression analysis is largely concerned with the problem of parameter 
estimation and the resulting hypothesis testing associated with these estimates. In the 
current statistical literature, there are a large number of methods for doing this. In this 
section we will briefly review some of the more common techniques of parameter estima- 
tion with particular emphasis on maximum likelihood estimation (MLE), a procedure we 
will find most useful in this text. 


2.9.1 The Method of Moments 


Suppose that X is a random variable whose density (discrete or continuous) fx (x) de- 
pends on m unknown parameters (01,05,...,0,,). We will denote this dependence by 
writing the density in the form 


/х (@ 0102.5: 0m), (2.126) 
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where the parameter vector 0 = (01,05,...,0,,) lies in some subset Ө С R™ called the 
parameter space. The method of moments uses the sample moments to estimate @ in the 
following way. Let X;,1 < i < n be a random sample of size n of X. The k-th sample 
moment of X;,1 < 4 € n is defined by 


xl Y (X. (2.127) 


ї=1 


If z;,1 € à € n are the actual observed values X;,1 < i < n, then 
1 n 
gU — =y n 2.128 
T n L (xi) ( ) 


is the observed value of x”. To estimate (01, 42,...,8m) we set up the following equa- 


tions: 
gti) = B(X*),1<j<m (2.129) 


and then solve (2.129) for the m values of the parameters 0;,1 < 1 < m. The resulting 
moment estimators are denoted by ài, 1<1 < т. Аз is customary in statistics, we will 
generally not distinguish between the estimators, which are random variables, and their 
observed values, the estimates, which are real numbers. 


(k) 


Most often, one uses the first m sample moments X^ ,1 < k € m. We illustrate 


these ideas with а number of examples. 


Example 2.15 Suppose that X is а Bernoulli random variable whose density is 
given by fx (009) = P(X 201 = 1-0, fx (50) = P(X =1} — 0, and fx (z;0) = 0, 
otherwise. As is easily shown, E(X) = 0 and x = у) Xi/n. Hence, using (2.129) a 
moment estimator for 0 is given by solving 


- È т; = E(X) = 6 (2.130) 


where z;'s are the observed values of X;,1 < i € n. The resulting estimator Ô for 8 is 
given by a5 
0=Xn, (2.131) 


which is the sample mean. 


Example 2.16 Let X be a М(и,о?) random variable (here 0; = и, 05 = c?) whose 
density is given by (2.99). To find moment estimators for (и, о?) we use the fact that 
E(X) = и and E(X?) = c? + и?. Hence, using (2.127) with the first two moments 


(2.132) 


= 
&| 


and 
* 1 
т?) = м Й т? = а? y. (2.133) 
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Solving (2.133) gives the estimators 
= ex p 
h2X,, 52 = Ж – Dx (2.134) 
i=1 


Some further algebra shows that 


1 < 2 

0 

= — Xi- X). 2.13 

PEON au. 
As we shall see, one generally uses n — 1 rather than n in (2.135) to define the sample 

variance 


Я 1 ш RT 
8? = EE у, (X; — X4) (2.136) 


which is an unbiased estimator for о?. 


Example 2.17 In many cases the equations (2.129) cannot be solved explicitly and 
numerical techniques must be used to obtain explicit values of the estimates. We illustrate 
this with some data taken from [16]. The problem concerns the distribution of group 
sizes in various social settings. If one assumes that groups such as pedestrians, groups 
at parties, etc., form at random, then it is reasonable to assume that the size of such 
groups is described by a Poisson random variable. However, since a group must contain 
at least one member, then P(.X = 0} = 0, which is impossible if X is Poisson. Thus we 
consider the conditional distribution of X conditioned on X > 1. This distribution is 
given by (see Exercise 2.14) 


jogs MR 2= 1,2 (2.137) 
UTC MEE 


where E(X) = 6/ (1 — e-9), 0 > 0. Using (2.128) for k = 1 we obtain an estimate 0 for 
0 by solving the equation 


8/(1 2 e) = т, (2.138) 


where Z,, denotes the average size of n observed groups. 

Using elementary calculus it is easily shown that (2.138) has a unique solution for 
T, > 1 and this condition will always be satisfied since R(X) > 1. For convenience, let 
En = и so that (2.138) can be written as 


=й (1 = е?) (2.139) 


To solve (2.139) we use а common numerical method, iteration. For this we make ап 
initial guess до and successively define the sequence б„ by 


(сылы (1 & е0) ‚п> 0. (2.140) 
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In Table 2.1 we show the convergence behavior of this method. As one can see, the 
method converges quite rapidly using the starting values of 05 = и. 


Table 2.1 Convergence of Bua =p(l— e-9» 


2.0 9.0 7.0 
5.0000 7.0000 9.0000 
4.0636 6.9936 8.9989 
4.9651 6.9936 8.9989 
4.9652 6.9936 


Note: “-” indicates convergence. 


Using the estimate 0 from (2.140) we can then estimate the density of X by 
ү; (2; 0) = е0) /а (1 — eê) i (2.141) 


In Table 2.2 we show data on shopping group sizes and the predicted values given by 
(2.141) using ĝo = 1.51 and 0 = 0.889. As one can see, the fit is quite good. This can be 
supported further by using the chi-square test [89, 63]. 


Table 2.2 Estimated Frequencies of Shopping Group Sizes 
Group Observed frequency Estimated frequency 


size x of group of size x of group of size т 
1 316 316 
2 141 141 
3 44 42 
4 5 9 
5 4 2 


2.9.2 Maximum Likelihood Estimation 


In maximum likelihood estimation we assume that a sample X;,1 <i < n, of a random 
variable X have a joint density (discrete or continuous) given by 


L = f (жү,хжо,..., En; 01,02, 0m) (2.142) 
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where again (01,05,...,0,,) are the unknown parameters whose estimates are given by 
choosing 6;,1 <i < m to maximize L for each observed sample sequence (21,22, ..., En). 
In effect, we estimate the parameters in such a way that makes the observed sample the 
most likely to occur. 

When Х;,1 < i € n is a random sample of X, then (2.142) can be written as a 
product 


T 
L-[[fx (£i; 01,02, sie Bas (2.143) 
і=1 
In this case it is usually easier to maximize log L (the log likelihood function) rather than 
the likelihood function L itself. If L is sufficiently smooth, then this may be done by 
solving the simultaneous equations 


OL 
00; 


As for the method of moments (2.144) usually must be solved numerically and there 
may be more than one solution. The resulting estimators are called maximum likelihood 
estimators (MLE) and have many desirable properties. Hence, they are perhaps the most 
frequently used estimators in statistical analysis. As we show in Chapters 3 and 5, much 
of regression analysis is based on this estimation technique. Several examples illustrate 
this approach. 


=0,1<i<m. (2.144) 


Example 2.18 Suppose X is a Bernoulli random variable with density given by 
(2.18). To find the MLE of 0 based on a random sample of size n we first observe that 
fx (2) can be written as 


fx (2,9 =0 (1-0) F, © =0,1,0<60<1. (2.145) 


From (2.143) 1, can be written as 


n 
L= [] fx (250) = OU ® (ү — pE (2.146) 

ї=1 
ЕУ z; = 0, then L = (1— 0)" which is maximized for 0 = 0, while if У)? 2; = n, 
L = 0" which is maximized for à = 1. If 0 < $77 у zi < n, then (2.147) can be maximized 


by solving ә 
log L 


00 


log L — (> 5) log 6 + С — а) log (1 – 0) (2.148) 


0. (2.147) 


From (2.146) 


so that : М 
Olog L == ae, "a EA (n B by zi) (2 149) 
00 —— 0 1—0 i | 
Setting Olog Г/00 = 0 and solving for Ө gives 
p М _ 
@=— Ti = Xm. (2.150) 
n 4 
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Using the second derivative test one can show that 0 in fact maximizes L. Hence, in all 
cases 0 = Tn, the sample mean, which is the same as the moment estimator. 


Example 2.19 Let X be a N (и, с?) random variable, we find the MLE of (u, о?) 
in the following way. If Х;,1 < à < n is a random sample of X, then using (2.99) 


Es П i me) exp |- (ж =? /2о?| j 


n 
2) 77/2 1 У 2 
= (21c ) exp -z 2- (2; E: n | (2.151) 
so that 
п, 1 2 2 
log L = -7 log 2r — nlogo — 252 3 (ri— pg). (2.152) 


As before, we can maximize log L using calculus by setting Olog L/Ou = log Г/дс = 0. 
Doing this gives 


S (ш -и)=0 (2.153) 
i=1 
and E 
_, Qi р) 
iR 2 > =0, (2.154) 
с c 
From (2.153) ё = %, and using this in (2.154) gives 
DO 1/2 
All NE 2 
ё = xe En) | . (2.155) 


Notice that in (2.155) that the MLEs agree with the corresponding moment estima- 
tors. This is not always the case. For example, suppose X is a uniform random variable 
with a density 

fx (т) = 1/0,0<тх<1 (2.156) 


then the method of moments estimator (MME) of @ is given by 
дмме = 2X, (2.157) 
while the MLE is given by 
Üu LE = max (Xi, Xa, ..., Ха). (2.158) 


Generally, one prefers дм LE to à MME Since it has smaller variance - hence it is 
more efficient (see Section 2.10 for the definition). The efficiency of MLEs is one of the 
properties that makes them so desirable in practice. 
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2.9.3 Least Squares Estimation 


In least squares estimation we assume that X;,1 < i < n, are n random variables with 
Е(Х;) = fi (81,05, ...,04). (2.159) 


If zj,1 € i < n are the observed values of X;, then the least squares estimates of 
(01,05, ...,0,,) are obtained by minimizing 


Q = [ei — fi (01, 02,- 9m)” (2.160) 


i—1 


with respect to (01,05, ...,0,,). If the f;, 1 < i € n are sufficiently smooth, then this may 
be done by calculus by solving 


8Q/00; =0, 1 X i € m. (2.161) 


As for maximum likelihood estimation, this usually must be done numerically. As we 
shall see, in classical regression analysis MLEs and least squares estimators (LSEs) are 
often the same. 


2.9.4 Bayesian Estimation 


In the method of moments and maximum likelihood estimation the unknown parame- 
ter(s) Ө = (01,05, ...,0,,) is assumed to be a fixed number subject only to the condition 
that 0 € O, the prescribed parameter space. In these forms of estimation all possible 
values of 0 are treated on the same footing before estimates are made. However, in many 
situations it is reasonable to assume on the basis of past experience that some values of 
0 are more likely to occur than others. For example, in tossing a coin, which is described 
by a Bernoulli random variable with 0 = P {X = 1) (0 = P{Heads occurs]) that values 
of 0 near one-half are more likely to occur than those near zero or one. In this situation 
we can think of 0 as a random variable and then fx (r;;0) can be interpreted as the 
conditional density of X given 0. If 0 has a density, then 


PIX =z} = [ fx (250) f (0) dé (2.162) 
Ө 
for a discrete random variable and 
Pie x cage As | fx (236) f (0) 40 (2.163) 
o 


for small Az if X is continuous. 

In Bayesian estimation observations are used to modify the prior distribution of 0 
via Bayes rule. Characteristics of this posterior distribution are then used to estimate 
0. As the theory can become quite complex, we will only illustrate the rudiments of the 
approach. 

If X;,1 € i € n, is a random sample of X, Bayes rule yields the posterior density 
(discrete or continuous) 


IL fx (ж 8) 


fex (025,25, En) = ТШЕ Fx @:;6)] f (0) 40` (2.164) 
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To estimate 0 one now chooses some characteristic of fex (8/21, 2,.-.,2n) such as 
its mean, mode (most likely value), median, etc. Choosing the mode, for example (if 
it exists) yields an estimate of 0 as the “most likely” value of Ө given the observations 
(MLE). Justification for using the mean or median can be given in terms of statistical 
decision theory. The mean of the posterior distribution is the most commonly used 
estimate. 


Example 2.20 Again we consider the problem of estimating the parameter of a 
Bernoulli distribution. Assume now that 0 is a random variable and IR(0) = [0,1]. To 
apply (2.164) requires the choice of a prior density fg (0). Suppose we are completely 
ignorant of the possible values of 6, i.e., all values of 0 are assumed equally likely. In this 
case we choose 0 to be uniform on [0, 1]. The posterior distribution is then 


922 i-i i (e gy» Ti 
Js [oz zi (1 — 8) «| d 
E peiz Ti (1 t gy Ei Ti 


Јх (0|z1, 22, s dia) = 


(2.165) 


where the denominator іп (2.165) is a beta function. Since the denominator is inde- 
pendent of 0 choosing 0 as the mode of the posterior distribution yields the maximum 
likelihood estimator X,,. 

If we use 0 = E (|Х), the posterior mean, then using properties of the beta function 
gives 
y» Xitl 


ô = E(6|X) = mu 


: (2.166) 
which differs from Z,. For example, if У) у 2; = 0 then 0 = 1/ (n + 2) and not zero. 
For small values of n the Bayes estimator tends to pull in extreme values of X„ towards 
the center of [0,1]. This is sensible since for small sample sizes there is a reasonable 
probability of getting strings of all zeros or ones. For large values of n the Bayes estimator 
Ó and X, differ by little. 

If complete ignorance does not prevail, then we will feel that some values of 0 are 
more likely than others and we are faced with translating our subjective feeling into 
quantitative statements about a prior distribution for 0. A typical choice is to assume 
that 0 has a beta density (this makes calculations easy to do) so that 


gi (ү — 7226 ge-t (1 — g)?71 


ix) = _ 2.167 
Дух ( | ) B(at+ YT V 248 tn- Y z) ( ) 
If one chooses Ó as the mode of (2.167), then if & > 1, 8 > 1, 
; Xu 
poo pip DUM (2.168) 
a+tBt+n-2 
whereas using the posterior mean yields 
8 41 Ха (2.169) 
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For large enough n all of these estimators will differ little from X ,,. 


Example 2.21 Let X bea N (0,07) random variable, and suppose that the prior 
distribution of 0 is N (n, 7°). Assuming that c?,7, and 7? are known, then the posterior 
distribution of Ө is also normal, with mean and variance given by (after some algebra) 


т? о? 
т?т? 


Thus the posterior mean, which is the Bayes estimator, is a linear combination of the 
prior mean and the sample means. 

It is worth noting that if we allow 7’, the prior variance, to tend to infinity, the Bayes 
estimator tends towards the sample mean. This implies that as the prior information 
is getting more vague; the Bayes estimator tends to give more weight to the sample 
information. 


2.10 Properties of Estimators 


2.10.1 Consistency and Unbiasedness 


Having presented several techniques for making parameter estimates and observing that 
different methods may produce different estimators, we now turn our attention to a 
discussion of some of the more important criteria which are used to evaluate competing 
ones. 

Since estimators are random variables, we can examine their distributions for clues to 
their characteristics. Since the purpose of sampling is to determine the true value of 6, it 
is reasonable to assume that as the sample size n increases the distribution of 0,, clusters 
more about the true values of 0 for all 0 € ©. This may be expressed in probability terms 


Е lim P là. = 6| « e} 27 (2.172) 


n-—oo 


for any є > 0. (This is usually called convergence in probability.) 1f (2.172) holds, we 
shall call 6,, a consistent sequence of estimators for 8. А 

Since опе can rarely compute the exact distribution of 04, we need a criterion which 
can be easily used to establish consistency. It can be shown that sufficient conditions for 
(2.172) to hold are that 


lim E (ôn) = 9 апі lim Var (ôn) =0. (2.173) 


n— o0 


A proof of this can be found in [40, 89]. 
One should note that if E (ên) = р, = 0 for all n (#„ is then unbiased) then it is 


sufficient that Var(6,,) — 0 in order that the sequence 0, be consistent. 
In general, the quantity 0 — u, is called the bias of the estimator and 


E (9, а y — Var (ôn) (9 4: (2.174) 
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is called the mean square error (MSE) of 6,. Thus an equivalent way of stating (2.174) 
is to say that 10.) is consistent if the mean square error converges to zero. Since the 
rate of convergence of Ôn to 0 depends on the mean square error, it is appropriate to look 
for estimates which minimize it. This сап be done, for example, by choosing 0,, to be 
unbiased and then picking Var (ôn) as small as possible. Such an estimator, if it exists, 


is called a minimum variance unbiased estimator (MVUE) of 0. If such an estimator 6,, 
exists, then we can define the efficiency of any other unbiased estimator w,, by 


Efficiency = Var (ôn) / Var (Ф„) i (2.175) 


More generally, we can say that for any two unbiased estimators à and Y that 6 is more 
efficient than w if Var (д) < Var (Ф). 

These observations lead to an additional principle of estimation, that of minimizing 
the MSE, or the variance if we want the estimate to be unbiased. In general, it is quite 
difficult to find minimum variance estimators and the general theory is beyond the scope 
of this text. (However, we will consider some special cases in Chapters 3 and 5.) 


Example 2.22 Let X be a random variable with density f (2; 6) such that E (X) = 
0. If Var(X) < oo, then 377 , Xi/n is a consistent sequence of estimators for 0 if 
Х;,1<1 < п, is a random sample of X. 

Using the linearity of expectation, 


ыы: lc n0 
E(X,) - 2, EQQ) - (2.176) 
so Xn is unbiased. From (2.73) 
- 1 Š 1 
Var (X4) = 72 D (Xi) = x [nVar( X )] 


— -Var(X) — 0, n — оо. (2.177) 


As particular cases, X,, is a consistent estimator of the mean p of a N (и, о?) random 
variable and for the parameter @ of a Bernoulli random variable. 


Example 2.23 Show that the sample variance 62 = У)? (Xi — pony /n is a con- 
sistent sequence of estimators for the variance of а normal random variable. 

From (2.113) we know that Уу (Xi — US /а? is a X? (n — 1) random variable so 
that 


E (82) = (n — 1) о?/п (2.178) 
and 
Var (82) = 204 (n — 1) /m?. (2.179) 
From (2.178) and (2.179) 
lim E (62): 8^ and lim Var (62). 0 (2.180) 


and (2.180) shows that [621 is а consistent sequence of estimators for о?. 
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2.10.2 Sufficiency 


Although consistency is a basic property which good estimators should have, it is a 
large sample property, and it is also important to have estimators which exhibit good 
properties for small values of n. One of these is unbiasedness, another one is sufficiency. 

To illustrate the concept of sufficiency we again consider the problem of estimating 
0, the probability of success on a single Bernoulli trial. If X;,7 = 1,2,...,n, is a random 
sample of size n, then we can ask what is the best way to summarize the sample data so 
that no information about estimating 0 is lost. A useful way of looking at this problem 
is to consider what happens if we want to store this information in a computer. Assume 
that all numbers are stored in binary so that recording all the outcomes gives rise to the 
binary number €)€2 · · · €n where є; is 0 or 1. Thus storing all the information about the 
experiment requires n “bits”. In Section 2.9 we have observed that X, is in many senses 
а good estimator for 0 and to compute X,, we need only store the number of successes 
$; Xi- This requires on the order of log; n bits, rather than n. 

For example, suppose we toss a coin 10 times and observe four H’s and six Т?з. Then 
storing (z1,22,...,219) requires 10 bits. Now A x; = 4 and 4 = 100 in binary and so 
its storage requires only three bits. Even if = 12 = 10 then 10 = 1010 (in binary) and 
we would need at most four bits. Since lim,_,.. log; n/n = 0 the information compression 
given by У)? у X; becomes arbitrarily large. In fact, it turns out that we cannot do any 
better, in the sense that У)”, X; summarizes all the information about the experiment 
for the purpose of estimating 0. Nothing further about {X;} such as the order of the 
outcomes needs to be used to estimate 0. In this case we say that У)? , X; is sufficient 
for 0. 

Establishing sufficiency of an estimator can be quite difficult. However, there is a 
famous theorem, the factorization theorem which often simplifies matters considerably 
[63, 89]. 


Theorem 2.1 (Factorization Theorem) Let Xj, X5,..., X, be a random sample of 
size n from the density f (2; 0). Then 0 (X1, X2,..., Xn) is a sufficient statistic for Ө if 
and only if the joint density [[7 4 fx (i50) = fx (11,22, ...,24;0) factors as follows: 


fx (41:295 es Tn; 0) =g fô (21,22, si) ; J h (21,22, “у En) (2.181) 
where h does not depend on 0. 
Proof. See [63, 89]. M 


Since a sufficient statistic summarizes all the sample information for estimating 0, it 
is often advisable to begin the search for а goal estimator by checking to see if a sufficient 
statistic exists. Since a monotone function of a sufficient statistic is again sufficient, one 
can then modify a given sufficient statistics to have other properties such as unbiasedness. 


Example 2.24 Let X1, X2,..., Xn be a random sample of size n from a N(0, 1) 
population. Then X, = 27У? Xi; is sufficient for 0, where X;,1 < à < n, is а random 
sample of X. 

To show this we use (2.181). Hence, 


1 122 2 
fx (TPs, аз BAGO) = Ga E 3 (2; — 0) | (2.182) 
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and using $77 4 (zi — 0)? = Уге Tn) + n (Tn — 0)?. We find that the right hand 
side of (2.182) becomes 


: exp E 2. (2; — | exp |-5 (Zn — 2d (2.183) 


Letting h (21,29, ...,2n) = exp |- О =)" /2| and ф(ту,®о,...‚,2һ;0) = (27) "^? 
exp |-n (z, — 0)? /2| in (2.183) shows that X» is a sufficient statistic for 0. 


The factorization criterion can be extended to the problem of jointly estimating m 
parameters (04, 05, ..., 0,4). 


Definition 2.3 Let X1, X5,..., Xn be a random sample of size n from the density 
fx (2;01,05,...,04,) that depends on т parameters. Then the statistics CI ый | 


are jointly sufficient for (61, 02, ...,9m) if and only if the joint density of a random sample 
of X factors as 


fx (z1; -Eni Cian) =9 fô, (С); (Ж (х) 10} һ(т\,..., а) (2.184) 
where x = (z1,722,..., Z4). 


Example 2.25 Let X;,X5,.., X, be a random sample of size n from a N (и, c?). 


Then X,, and S X x^ are jointly sufficient for estimating (u, a) This follows 
from (2.184) since the joint density is 


fx (21,22, ..., En} 1, 07) T wear” -z y (z; — | 


i=1 


From this we see that 


fx (21,22, -52nj ha = 9 Е, У (i ж); и, ^ (2.186) 


i=1 
Thus (2.186) is satisfied by taking h (£1,£2,..., 74) = 1. 


Since a sufficient statistic summarizes all the data for 0 one can expect that “best” 
estimators for 0 should be a function of a sufficient statistic. In fact it can be shown that 
if Ê is an unbiased estimator for 0 then E(0|0,) is also an unbiased estimator, where 6, 
is sufficient, and has variance no greater than Var(0). Thus in order to find minimum 
variance estimators it suffices to look for those which are functions of sufficient statistics. 
For further results along these lines the reader is referred to Ref. [74]. 
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2.11 Confidence Intervals 


2.11.1 Exact Confidence Intervals 


Up to this point we have examined methods for finding point estimates of parameters for 
a given random variable. However, as in non-statistical calculations, when an approx- 
imation is made it is important to have some estimate of the error in the calculation. 
For example, if we wish to estimate some real number z by the number 2 and Az is the 
absolute error then we will have 


ĉ— Ат < т< + Ат. (2.187) 


That is, the true value z lies in the interval [2 — Az, # + Az]. In statistical estimation 
if 0 is the parameter of a random variable and 6 is an estimator, then the interval 

0— A0, Ê + Аб | has random end points and so we cannot make error statements with 
probability one, but only with a given level of confidence 1—a@. Moreover, we need to make 
probability statements which do not depend on the unknown parameter 0. Formalizing 
these considerations leads us to the notion of a confidence interval (or interval estimate) 
for a parameter @. 


Definition 2.4 Let X be а random variable whose distribution depends on a para- 
meter 0. A confidence interval for 0 with confidence at least (1 — a) x 100% is a pair of 


statistics (41, до), i < by, such that for all 0 € © 


P fô, TET ра, (2.188) 


If the inequality in (2.188) сап be taken to be an equality, we say that (ô Bs ди) іѕ ап 


exact (1 — а) x 100% confidence interval for 0. 

For continuous random variables one can often find exact confidence intervals, whereas 
for discrete random variables one can usually only satisfy (2.188) with the inequality. 

A standard technique for finding confidence intervals is based on the notion of a 
pivotal quantity Q where Q is a function of a random sample (X1,X5,..., Xn) of X 
whose distribution does not depend on the unknown parameters in the distribution or 
density of X. Q itself is not a statistic since it will usually be a function of the unknown 
parameters. To illustrate these ideas consider the problem of finding a confidence interval 
for the mean и of a № (и, c?) random variable and suppose for the time being that с 
is known. Then from previous considerations we base our estimation of jj on Xn. Now, 
Qn = Vn (Xn — и) /o is № (0,1) and so can serve as a pivotal quantity. Thus we can 
determine (a, b) independent of (и, c?) such that 


P{a<Q, <} = 1-а. (2.189) 
Manipulating the inequality іп (2.189) gives 
P{X, —bo//n nu € Xn — ac/ n) 21—o (2.190) 


so that (Xn — ba / Vn, Xn — ac / /n) is a (1 — а) x 100% confidence interval for и. Since 
а and b are not unique, we have in fact found infinitely many confidence intervals for p. 
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То make б E p as “short” as possible a and b should be chosen to minimize E c —ó 1) 
and this is easily shown to result in the choice a = —b. Thus, 


P{-b<Q, <b} =1-a, (2.191) 


and this gives b = zy/2 where Р {On > Za/2} = a/2 and 2,2 is the [(1 — a/2) x 100%]-th 
percentage point of a N (0,1) random variable. This gives the confidence interval as 


(Xn — Zo 30 | VN, Хь+ Za/2o//n) . (2.192) 


Example 2.26 (Confidence interval for c?) Let X, X9,..., Xn bea random sample of 
size n from a N (p, 07). Find a (1— o) x 100% confidence interval for c? if и is unknown. 
By (2.113) we know that 


x = У(Х; - x (2.193) 


is a x? (n — 1) random variable. Since the density of x? (n — 1) does not depend on о?, 


we can use x? as a pivotal quantity. Thus we choose (a, b) so that 
Р{а< ҳ? < b} =l-a. (2.194) 


One solution to (2.194) (the customary one) is to determine b by P(x?(n- 1) 2 b} = 
@/2 and a by P (x? (n — 1) < a) = 0/2. If we let x2 ‚= band X2. ,, 4/5 = a then 


Lx = 5 
Р а-и S cd У`(Х-Х,) < -—- T y (2.195) 
i=l 


After a little manipulation, (2.195) gives 


(> (X = Xa) Leer , p» (Xi = к аа-а) (2.196) 


i=l i=1 
as a (1 — œ) x 100% confidence interval for c?. 


We should point out that confidence intervals derived under the assumption of nor- 
mality are often used even if this condition is not met. Usually this will prove satisfactory 
if the sample size is large (> 30) or if the distribution of X is not highly skewed. This 
property is termed robustness. 


2.11.2 Approximate Confidence Intervals 


In all the cases considered so far we have been able to find exact confidence intervals 
for the parameters at hand. In many cases exact confidence intervals are not easily 
found either because the underlying random variable is discrete or because no convenient 
pivotal quantity can be found. However, if the sample size is large we may often find 
random variables whose distribution or density is approximately independent of the given 
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parameters and so may be used as approximate pivotal quantities. These may then be 
used to give approximate confidence intervals for a given parameter. 

As a typical example suppose that 0 = E (X) is the expected value of X. If Var(X) < 
oo is known and X;,i = 1,2,...,n is а random sample of size n of X, then it follows 
from the Central Limit Theorem (CLT) that for large n that Zn = vn (Xn — 0) /a is 
approximately N(0,1) and so may be used as an approximate pivotal quantity for 0. 
Proceeding as in the preceding section we find that 


(Xa = 29/20 | Vn, Xn F Zaj20/ Vn) (2.197) 


is an approximate (1 — œ) x 100% confidence interval for б. 

If Var(X) = о? is unknown, then as we mentioned in Section 2.8, Eq. (2.117) may 
be used to find an approximate confidence interval for u. Since T random variables are 
approximately normal for large n (> 25), we get the further approximation 


(Xn — 2a/28n/Vn, Xn + 2a/28n/Vn) (2.198) 


as an approximate (1 — o) x 100% confidence interval for џ. 


Example 2.27 Let Xj, Х»,..., Xn be a random sample of size n from the Bernoulli 
distribution with parameter @. For large n find an approximate confidence interval for 
= P{X =1}. 

Define X, = У), Х,/п. From the Central Limit Theorem, yn (Xn — 0) / V0 (1 — 0) 
is approximately N (0,1). Thus, 


(X^ — au 2 V0 (1 — 0) /n, Xn + zaj VO- 0) /n) (2.199) 
is an approximate (1 — о) x 100% confidence interval for б. 


Since the end points in (2.199) depend on 6, which is unknown, (2.199) cannot be used 
as it stands. Although (2.199) can be manipulated to produce a legitimate confidence 
interval, it is simpler to replace 0 with its unbiased estimate X,, (or frequently denoted by 
and called the sample proportion). This gives the further useful approximate confidence 


interval 
(x. — Zaj2A Xn (1 — Xn) /n, Xn + za a] Xn (1 — Xn) in) (2.200) 


for @. 


2.12 Hypothesis Testing 


In the estimation problem experiments are conducted for the purpose of trying to de- 
termine the value of some unknown parameter 0. If we have some preconceived idea of 
what the parameter should be, we may want to conduct an experiment to either confirm 
our belief about 0 or reject the hypothesized value of 0. For example, suppose we are 
given а coin which we feel is fair, but to be on the safe side we decide to conduct an 
experiment to see if this is true. Here we hypothesize that P (Heads) — 1/2 and after 
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performing the experiment we will make one of two decisions. We can accept the hy- 
pothesis that P (Heads) = 1/2 or reject this and accept the alternative possibility that 
P (Heads) Z 1/2. 

A general version of this type of problem may be given in the following terms. Suppose 
that X is either a discrete or an continuous random variable with density fx (2; 0). 
Suppose we know @ can take values in one of two possible sets Op and O1. An experiment 
is performed to measure X and on the basis of the results we will make one of two 
decisions; either 6 € Qo or 0 € Ө). 

The statement that 0 € Oo will be called the null hypothesis which is customarily 
denoted by Ho. The statement that 0 € Ө, is called the alternative hypothesis and is 
denoted by Hı. In customary terms we say we are going to test Ho : 0 € Oo against 
H 1:25 0 Є Ө}. 

In order to test the hypothesis suppose we take а random sample of size п of X. If 
К” (X) denotes the sample space, we will decide in favor of Ho if x = (21,275,..., 24) € 
So С R” (X) and decide in favor of Ну if x € Sı = So . The partition (50,51) of the 
sample space R” (X) is called a test for the hypothesis Ho against Hı. The theory of 
hypothesis testing is essentially concerned with devising “good” tests. 


Example 2.28 А professor has observed that over the past 10 years his probability 
class has achieved a mean grade of 76 on the final exam and the Dean has suggested that 
he make greater use of audio-visual aids in order to have his class achieve higher grades. 
The professor feels that if he uses such aids the mean grade will not be decreased so he 
decides to do this next semester to see what happens. Describe the decision that the 
professor faces as a hypothesis testing problem. 

Let us assume that the final grade of the typical student can be described as a N (0, с?) 
random variable. then the professor wants to decide whether 0 = 76 or 0 > 76. Here we 
have Oo = (0 = 76) and Ө; = (0 > 76). Thus Ho : 0 € (76) and Hı : 0 € (0 > 76). 

As a test we choose 51 = {Xn > k} and So = TX < k} for some k. (Justify this.) 


2.12.1 Best Tests 


To devise good tests we must consider the nature of the errors that can be made in 
deciding between Но and Hı. We begin with the special case where 69 and 0, each 
consist of one point. This is customarily called the problem of testing a simple hypothesis 
against a simple alternative. (If either 09 or 01 consists of more than one point then we 
say that the corresponding hypothesis is composite.) 

Since either Не or Н, can be true there are two types of errors that we can make. We 
can accept Но when Н, is true or we can accept Hı when Но is true. Since we accept 
Ну when x € So we decide in favor of Н; when x € Sı = So. Thus the probability 
of deciding in favor of Hı when Но is true is Po (51) = P (S1|Ho) where Po (51) is the 
probability of Sı using the measure induced on R” (X) by fx (2; д0). If we decide in 
favor of Hı then x € $1, so the probability of deciding in favor of Но when H is true 
is P(So|Hi). Sı is called the critical region of the test (i.e., the region where Но is 
rejected) and P (5S1|Ho) = o is called the size of Type I error. P (So|Hi) = В is called 
the size of Type П error. Since а and f are the probabilities of making errors, a good 
test should try to make these as small as possible. Ideally we would like to minimize a 
and 8 simultaneously. Unfortunately this cannot be done. In the extreme case, if we 
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take So = R” (X) then a = P (S5 = ¢|Ho} = 0 while 8 = P (R"(X)|Hi) = 1. Thus 
minimizing о maximizes 8. Hopefully, we can find some happy medium. | 

In the most widely used version of the hypothesis testing problem a is fixed and 
we try to find a test which minimizes 8. A test which minimizes 8 from among all 
those which have the size of Type I error at most с is called a best test. Theorem 2.2, 
the famous Neyman-Pearson lemma, shows how to construct a best test for a simple 
hypothesis against a simple alternative. 


Definition 2.5 If a test has type I error of size a, then the critical region $1 is said 
to be of size a. а is also called the significance level of the test. 


Theorem 2.2 (Neyman-Pearson Lemma) If there exists a critical region S; of size a 
and a non-negative constant k such that 


И ii 
= Iii fx (2561) >k, xe 51 (2.201) 


PTE fx Gua = 


and 
L<k, хє So, (2.202) 


then (So, 51) is a best test of Ho : 0 = 09 against Н, : 0 = 9). (An intuitive interpretation 
of (2.201) and (2.202) is that we reject Но if the “probability” of x = (£1, £2, ..., En) 
occurring under Ну is much larger than the probability of x occurring under Но.) 


Proof. А proof of Theorem 2.2 can be found in [63, 89]. Ш 


The ratio L in (2.201) is usually called the likelihood ratio and the test given by (2.202) 
is a likelihood ratio test (LRT). Before giving several specific applications of Theorem 2.2 
let us outline the mechanics of its use. 

In general the inequality L > k is quite complicated so that we begin to determine the 
critical region by manipulating L > k so that 5, is given by a more tractable inequality 
L' > k’. Since L, or equivalently, L’ is a function of x, it is a random variable, and 
then k’ is chosen to that P {L’ > k'|Ho} = o. In order to do this we need to be able to 
calculate the distribution of L’ under Но. In special cases this can be done exactly. More 
often, one usually has to rely on approximations, such as the Central Limit Theorem. 


Example 2.29 Let X1, X2,..., Xn be а random sample of size n from a N(0, 1). 
Find the best test based on a random sample of size n of Но : 0 = 0o against Ну: 0 = 0} 
where 04 > бө. 


Here, fx (x;0;) = exp [- 251 7 0)? /2] / 2x)" ‚4 = 0,1. Thus, the likelihood 
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ratio 


65 zj 61) з, zj — 0o) ji 


j=l 
= exp [ |- Yen s — пб? P lys, en) 
j=l 
= exp e- 2» exp Е (65 — 68) . (2.203) 


In this case the critical region is 


exp С — ĝo) У. ехр E (65 — Д] >k. (2.204) 


i=1 


Solving (2.204) using (01 — до) > 0 gives 


n 2 2 
У`л>а,а= sero n S ај. (2.205) 


Thus, our test is to reject Но when у, 1 Ti 2 а, where a is chosen so that 


P | Noriza m) =a. (2.206) 
ї=1 


Under Ho, Xi is №(00,1) so that (У, xi — лбу) / /n is N(0,1). This gives 


= Yi Li — nbo _ a— nbo 
(heee) tee 


= P {Z > (a — nbo) / yn) . (2.207) 


By the Neyman-Pearson lemma, the best critical region is given by 


(a = nlo )/ n = Zo (2.208) 


where P {Z > za} = а. Thus we will reject Но if 


У ту > упа, + nbo (2.209) 
j=l 
or equivalently if pP. j = 189 ) Гут > zy, where (z1,72,..., 24) are the observed 
outcomes. 
As a numerical example, suppose we wish to test 0 = 1 against 0 = 2 for a sample of 
size 25 and a = 0.05. Suppose i 2; = 30, what decision is made? By Eq. (2.209) we 


j= 17 
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will reject Ho if 052, £; > V25z0.05 + 25 = 33.2. Since У", ху = 30 < 33.2 we accept 
the hypothesis that 0 = 1 at the level of significance 0.05. 


Before examining more general hypothesis testing problems we make some further 
comments about the Neyman-Pearson lemma. First, if the unknown parameter is a 
vector Ө = (01,05,...,0,,), then Theorem 2.2 provides the best test for Но : 0 = 6% 
against Hı : Ө = 0, since the scalar nature of 0 plays no role in the proof. Second, the 
Neyman-Pearson lemma sometimes provides a best test for a simple hypothesis against 
a composite alternative in the sense that the critical region minimizes simultaneously 
for all values 0, € Ө}. 

For example in Example 2.29 we should note that the critical region does not depend 
on 0; as long as 01 > ĝo. Thus the critical region ae rz; 2 /пг + nlo minimizes 8 
for all 01 > ĝo. Such tests are usually called uniformly most powerful (UMP) tests. 


2.12.2 Generalized Likelihood Ratio Tests 


Suppose now that we wish to devise good tests to test the general hypothesis Ho : 0 € Qo 
against Hı : 0 € Ө}. If either Но or Н, is composite there is no generalization of the 
Neyman-Pearson lemma available. However, it seems reasonable that good tests may be 
devised by considering the likelihood ratio 


L = fx (x; Ө) / fx (х; б) , 069 € 99,0; € €. (2.210) 


We will restrict our attention to the case of testing a simple hypothesis against the 
possibly, composite, alternative Н, : 0 € Oj. 

In this case 0 € Ө, is unknown and so must be estimated in some fashion. If we use 
the МГЕ б, of 0 then the analogue of the Neyman-Pearson lemma is to reject Ho if 


L= fx (x; 01) / fx (х; до) > k, (2.211) 


where k is chosen so that 
P(L > k|Ho) =a. (2.212) 


The test determined by (2.211)-(2.212) is called the generalized likelihood ratio test 
(GLRT) of Ho against Hj. 

As a particular case consider the situation of testing Ho : 0 = 09 against Hı : 0 # 0o, 
0 € Е. Here 0, = R—0,, and in cases where X has a density it is permissible to find the 


MLE of 0 over R (В = real numbers), since generally P là. = бо} = 0. 


Example 2.30 Let X be a N(0,1) random variable. Determine the generalized 
likelihood ratio test for Ho : 0 = 09 against Н, : 0 Z 0o. ; 
From Example 2.19 the maximum likelihood estimator of 0 € К is 0 = 377 ., zi/m. 
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Thus 


= exp E » { (2; — 90) — (En — oP 


= exp E 2 (x; — 09)? +n (En — a} (2.213) 
Using this in (2.211) gives 
1, =ехр |n (En — 60)? /2| | (2.214) 


Thus, the generalized likelihood ratio test is equivalent to rejecting Ho if 
(254850) Sa, (2.215) 


and this is equivalent to rejecting Ho if either Zn — 09 > va = b or Zn — ĝo € —Va = —b. 
(This test is usually called a two-tailed test.) b may be determined as in Example 2.29. 

When Hp is true we recognize the left hand side of (2.207) as the value of Т?, where 
T is the T statistic discussed in Sections 2.8 and 2.11. The likelihood ratio test in this 
case becomes: reject Ho if either 


T>b or T<-b (2.216) 


where 6 is chosen so that 
P{|T| <b} =1-а. (2.217) 
Since Т has a t density with n — 1 degrees of freedom b = tn—1,a/2 and the critical region 
is 
T < —tn-1,a/2 or T > tn—1,a/2- (2.218) 


2.13 Hypothesis Testing and Confidence Intervals 


In Section 2.12 we observed that the t test for normal means was based on the same 
statistic used in developing a confidence interval for the same parameter in Section 2.11. 
This fact (and similar ones) leads one to examine that relationship between the notions 
of confidence intervals and hypothesis testing. 


Suppose then that (22,00) is an exact (1— а) x 100% confidence interval for a 
parameter 0, then 0, < 0 < Üy with probability 1 — o. Since 6, and д, place bounds 
on the possible values of 0 we can use this information to develop a test of Ho : 0 = 00 
against Н, : 0 Æ ĝo. If 09 € (ô tôu) we accept the hypothesis, otherwise we reject it. 


This gives a test whose critical region consists of the set C = 16 > dy} U T <6 1}. 


Since ; : 
P (C|Ho) - 1 — P {Or < bo < by} zd) (2.219) 
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we have a test of size a. (Sometimes such tests are equivalent to a generalized likelihood 
ratio test, such as the case of the t-test, other times they are not.) 
One-sided confidence intervals may be used to develop tests of one-sided hypotheses. 


For example, suppose we have a lower confidence limit Ür where P { Or < o} =1-а@. 


Then, if we want to test the hypothesis Ho : 0 = бо against Ну: 0 > 0o we can do it 
in the following way: accept Ho if 0; < @ and reject Но otherwise. In this case, we 
have a test of size a whose critical region is 6 ee > 0o. As a particular case, suppose that 
X is №(0, 02) where с is known. Then Îr, = Xn — Zac / yn is a (1— o) x 100% lower 
confidence limit for 0. Then, if we apply the с region determined by 0 L, the critical 
region for testing Ну: 0 = ĝo against Н, : 0 > ĝo is: reject Ho if X, — zac//n > bo 
or Xn > bo + zac //п. This is the likelihood ratio test obtained in Example 2.29 (there 
о = 1). 

A similar procedure may be used for testing Но : 0 = Oo against Hi : 0 < Oo 
by using an upper confidence limit др for 0. The details are left to the reader. This 
connection between confidence intervals and hypothesis testing will play à major role in 
our development of tests associated with various hypotheses in regression analysis. 


2.14 Exercises 


2.1 Consider an experiment that consists of tossing a fair coin and a 6-sided die. 
(a) Describe the sample space with the elements for the experiment. 
(b) What is the probability of the event that either а head of the coin occurs or 
the number of dots is greater than four? 
2.2 For given two events A; and Ag, find the union and the intersection where 
(а) А; = {0,1,2,3}, Ao = (2,3,4,5). 
(b) A ={z:0 <x <2}, A2 = {x:1< x <3}. 
2.3 Let A and B be two events such that P (A) = 0.48, P (A U B) = 0.72. 
(a) Find P (B) if P (AN В) = 0.24. 
(b) Find P (B) if P (B) = 0.64. 
(c) Find P (B) if A, B are mutually exclusive. 
(d) Find P(B) if A, B are independent. 
(e) Find P (B) if P(A|B) = 0.36. 
2.4 Let f (x) = 2/10, = = 1,2,3,4, zero elsewhere, be the pdf of a random variable X. 


(a) Find the distribution function F (x) of X and sketch its graph along with that 
of f (x). 
(b) Find P(X = 1 or 2) and P (1 < X < 2). 


2.5 Let the random variables X and Y have the joint pdf 


- 2, 0<т<у<1,0<у<1 
ло = | 0, elsewhere. 
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a) Find the marginal probability distributions of X and Y, respectively. 

b) Show that the correlation coefficient of X and Y, px y = 1/2. 

c) Find the conditional distribution, f (y|z) and the conditional mean, E (Y |2). 
) 


( 
( 
( 
(d) If the conditional mean of Y, given X = х, is linear in y, then that conditional 
mean is given by 


oy 
E (Y |x) = py + p— (2 — px). 
Ox 


Verify this in (c). 
(e) Verify that the conditional variance of Y, given X = x, Var (Y |z) = of (1 — ру). 


2.6 Suppose that the bivariate density for (X,Y) is given 


su 8ту, 0<х<у<1 
f (2,4) = { 0, elsewhere. 


(a) Find Var (X + У). 


(b) Find the coefficient of correlation, px y. 
2.7 Show that X + Y and X — Y are uncorrelated if and only if Var (X) = Var (У). 
2.8 Show that Cov (X, X +Y) = Var (X) + Cov (X, Y). More generally, show that 


Cov У ах, УЫ; = У`У`а Оо (Xi, Y;). 
i j j 


1 


2.9 Show that each of the following families of distributions is an exponential family. 
(a) normal distribution with either parameter и or с known; X ~ N (yp, 07). 
(b) gamma distribution with either parameter a or 8 known; X ~ С (а, 8). 

[fx (2) = Bo! exp (— a) /T (a) | 

(c) Poisson distribution with parameter А; X ~ Poisson (А). 

2.10 Given that X have a Poisson distribution with variance 1, find P(X — 2) and 
Р(Х $2) 

2.11 Let the random variable X have a normal distribution with the mean и and 
variance o?. Consider a transformation Z = (X — и) /о. 
(a) Show that E (X) = 0, and E (Z?) = 1. 
(b) Let M (t), —h < t < h denote the moment generating function of the random 
variable X, show that E [exp (£Z)] = exp (—ypt/o) M (t/o) , һо < t < ho. 

2.12 Let X be a random variable whose second moment exists. 
(a) Show that E (x — b) is a minimum when b = E (X) for all real number b. 


(b) Show that E (X?) > (E(X)|. 
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2.13 Let X have a binomial distribution with the parameters n and p. 
(a) Find the moment generating function of X, Mx (t). 
(b) Using the result in (a) find the mean and the variance of X. 
(c) Compute E(X) and Var(X) by direct summation. 


2.14 If X is a Poisson random variable, calculate P (X = z|X > у}. (These distribution 
for fixed y and variable x is sometimes called conditional Poisson distributions.) 


2.15 Suppose X follows a Poisson distribution with parameter A. Then, is P(.X takes 
an even value} = P (X takes an odd value}? 


2.16 Let Ф (2) denote the cumulative distribution function of standard normal random 
variable Z, i.e., ®(z) = P (Z < z). Show that ®(—z) = 1 — Ф (z). 

2.17 If X is N (u,c?), find c such that 
(a) P(—c < (X — u) /a < с} = 0.90. 
(b) P (CX — u) /о| > c) = 0.05. 

2.18 Suppose that X is normally distributed with mean и = 10 and variance o? = 4. 
(a) Compute P (|X — 9| » 1). 
(b) Find x such that P (X > т) = 0.9750. 
(c) Find the mean and the variance of Y — 3X — 2. 

2.19 Suppose that X/o? has a chi-square distribution with 10 degrees of freedom. 
Determine the pdf, mean, and variance of X. 

2.20 Let X be x? (30). 
(a) Find E (X) and Var (X). Use these to approximate P (18.5 « X « 43.8). 
(b) Find the exact value of P (18.5 < X < 43.8). 


2.21 Let s? be the variance of a random sample of size n = 6 from the normal distrib- 
ution with unknown и and variance o? = 12. Find P (2.30 « s? « 22.2). 


2.22 Find the mean and variance of s? = 157? (X; sm. where Xj, X»,..., Xn 
denote a random sample of size n from a N (и, 07). (Hint: Find the mean and 
variance of (n — 1) s?/o?.] 


2.23 Let U have a uniform distribution on [0,1]. Show that the variable —2 log U has 
a x?-distribution with 2 degrees of freedom. [Hint: Consider P (—2logU < z).] 


2.24 Suppose that Т has a t-distribution with 8 degrees of freedom. Find P (|| > 2.306). 


2.25 Let T have a t-distribution with 12 degrees of freedom. Find k such that 
(a) P(—k <T < К) = 0.90. 
(b) P ([T| » k) — 0.99. 


2.26 Let F have an F-distribution with degrees of freedom vı and v3. Show that 1/F 
has an F-distribution with degrees of freedom v2 and v4. 
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2.27 If F has an F-distribution with degrees of freedom vı = 2 and vg = 6. Find 
а and b such that P(F <a) = 0.05 and P(F < b) = 0.95, and accordingly, 
P (a < Е < b) = 0.90. 


[Hint: Write P (F <a) = P(1/F > 1/a) = 1— P (1/F < 1/a), and use the result 
of Exercise 2.26.] 

2.28 Let a random variable X have a lognormal distribution and let Y have a normal 
distribution. 
(a) Using the cdf technique, find the pdf of X. [See hint in Ex. 2.23.] 
(b) Express the k-th central moment of X using mgf of log X. 
(c) Use the result in (b) find the mean and the variance of X. 


2.29 Let 0 denote а parameter and define the bias B (8) == |Е (8) Е ol. For an 


estimator à, show that the mean square error of the point estimator 6, 


MSE (6) = E C » QN = Var (6) + [В Oji 


2.30 Let X1, X5,..., Xn be a random sample of size n from a Poisson distribution with 
mean А. Find the method of moments estimator of А. 


2.31 Let X1, X5,..., X, denote a random sample of size n from the normal distribution 
with mean и = 0 and unknown o?. Find the method of moments estimator of о?. 


2.32 Suppose that X, X», ..., Xn constitute a random sample from the gamma distri- 
bution with parameters (o, 8). Find the method of moments estimators of (о, 8). 
[Hint: Find E (X), Var (X).] 

2.33 Let X, X»5,.., Xn be iid. random variables with common pdf (or pmf). Find 
an MLE for Ó in each of the following cases. | 
(a) f (x) = (1/0) exp (—z/0), 0< £ < oo, 6>0. 

(b) f (x) = ехр(-2 + 89), 0 < z < оо. 
(c) p(x) = 1/0, x = 1,2,...,0, 1 € 0 < Ө, Oo is a known integer. 

2.34 Let Ху, X5,..., X, be a random sample of size n from a Poisson distribution with 
mean À. 

(a) Find the maximum likelihood estimator À for A. 
(b) Find E (4) and Var (4) | 
(c) Show that the estimator in (a) is consistent for A. 
(d) Find the MLE for Р(Х = 0) = exp(—2). 
2.35 Suppose that we have a random sample of size 5, X1, X2,.., X5, from an exponen- 


tial distribution with pdf given by f (x) = (1/0) exp(—2/0) ,x > 0. Suppose that 
we consider the following estimators; 
Xı Ха 


0; — Xi, б = ——— +2, 
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where X = i pae (7 
(a) Determine which estimators of 0 are unbiased. 
(b) Calculate the variance of each of the estimators. 
(c) Calculate the efficiency of 6, relative to 63, and of 63 relative to 64. 
2.36 Let X1, X5,..., Xn be a random sample of size n from a certain uniform distribu- 
tion. Find the MLE of 0 if 
(a) the distribution is U (0, 0). 
(а) the distribution is U (0 — 2,0 + 5). 


2.37 Let X, X5, ..., Xn denote a random sample of size n from the Poisson distribution 
with parameter А. Show that $77 , X; is sufficient for A. 


2.38 Use the factorization criterion to determine, in each case, a sufficient statistic 
based on a random sample size of n. 
(а) p(z) = p” (17 p) ^,2-— 0,1. 
(b) p(x) =р(1— р)", = =0,1,2,... 
(c) f (т) = Aexp(—Az), 2 > 0. 


2.39 Let the observed value of the mean Y of a random sample of size 15 from a 
N (0,52) be 13.2. Construct a 90% confidence interval for 0. 


2.40 Let Xj, Хэ,..., Хуу denote a random sample of size п = 10 from а N (и, c?) 
random variable. 


(a) If о is known, find the length of a 95% confidence interval for и if this interval 
is based on the random variable v10 (X — и) /o. 


(b) If o is unknown, find the expected value of the length of a 95% confidence 
interval for и if this interval is based on the random variable /9 (X — p) /s. 


2.41 Suppose that X1, Xo, ..., X200 is a random sample from a Bernoulli(p) population. 
Let p = 5>2;/n and if the observed value of p = 0.25, find an approximate 90 
confidence interval for the true proportion p. 


2.42 Let X1, X2,..., X16 be a random sample of size n = 16 from а N (y,1) random 
variable. We wish to test Ho : и = 20 against Ho : p # 20 at a = 0.05, based on 
the sample mean X. 
(a) Determine critical regions of the form C1 = {7:7 < a) and C2 = {Z : Z > b}. 


(b) Find д, the probability of a Type II error, for each critical region in (a) for the 
alternative Ну: p = 21. 


2.43 Let X1, Xo,..., X25 denote a random sample of size п = 25 from a N (џи, 100) 
random variable. Derive the uniformly most powerful (UMP) test with size a = 
0.10 for testing Ho : р = 75 against Hı : и > 75. 
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2.44 Let X1, X2,..., Xn denote a random sample from the Poisson distribution with 
parameter й. 
(a) Find a UMP test with size a = 0.05 for testing Hp : 0 = 00 against Н; : 0 > 4. 
(b) Sketch the power function, к (0) for 69 = 1 and n = 25. Take a = 0.05. (Hint: 
Use the Central Limit Theorem. | 


2.45 Find a generalized likelihood ratio test of size a for testing Hp : 0 < 1 versus 
Н, : 6 > 1 based on a random sample X1, X5, ..., Xn from f (x) = 0exp(—02),0 < 
T < оо. 


Chapter 3 


Simple Linear Regression 


3.1 Introduction 


In this chapter we begin our formal analysis of regression models by considering the 
simple linear regression model 


Yz = bo £12 + Ez. (3.1) 


Here, Y, is the dependent or response variable, x is the independent or design variable 
and =: is an error random variable used to represent the variation of Y, not explained 
by the linear part 6) + 94x. Conceptually, we may think of the values of Y, as generated 
by choosing a value of z, computing lo + 84x and adding the random error given by єх. 
Thus, if we pick n values of x = (£1, £2, ...:4) and plot the resulting data (zi, y; = yz,), 
then depending on the variability of ¢,, these points should scatter around the line 
y = Bg + B41x. Of course, in practice, only the data (z;,yj),1 € i € n, will be known 
and the purpose of the statistical analysis is to determine if the model (3.1) is capable 
of explaining the observed variability in the y’s. In addition, since some of (Go, £4, €x) 
are usually unknown, our first task is to find estimates of these quantities. Comparison 
of the fitted model | А 

9 = Po + Pic, (3.2) 


where (Bo, B,) are estimates of (8,8 ), with the known data then provides a way 


of determining whether or not (3.1) appears to be appropriate. Although many tech- 
niques are available for doing this, for instance, merely drawing an ‘eyeball’ line through 
{(xi, yi) };_,, consideration of more elementary statistical situations, knowing the distri- 
bution of Ү,, and hence of £}, is crucial. However, a priori, there is no more reason for 
knowing this than knowing (8o, 81) and this argument seems to bring us to a logical 
impasse. To proceed further there are two avenues of approach: 


(1) estimate (8o, 61) using a method which does not depend on the distribution of Yz; 


(2) make a plausible assumption about the error structure, estimate (8o, 81) and then 
check the assumptions. 
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As we shall see, in the important case where the errors €, are assumed to have a 
normal distribution and mazimum likelihood estimation is used to obtain (Bo, By) we 


obtain a method of estimation, least squares, which can be applied to studying linear 
models regardless of the underlying error distribution. This technique, apparently first 
derived by Gauss in the 18th century, is the most widely used method of estimation in 
regression analysis and most of this book will be devoted to analyzing its consequences. 
However, because of problems which can occur when the errors are not normal, much 
work has been done recently on alternative estimation procedures, such as maximum 
likelihood and robust regression technique. Some of these are discussed in [27, 87]. 


3.2 The Error Model 


As we have previously stated, the simple linear regression model is of the form Y, — 
Bo + 81T +, where we now assume that €; does not depend on т and E(e,) = 0 so that 


E(Yz) = Bo + Biz. (3.3) 


If n observations of У, are made at the points (z;); , = (z)1, then letting Y; = Yz, 


denote the i-th observation, 
Ү = Bo + 8,2; + €i, (3.4) 


where є; = Ez; i = 1,2,...,m. 

To estimate (80, 81) we assume that Ү;,1 < i < n are independent normal random 
variables with mean B + 8,2; and common variance c?. This is equivalent to assuming 
that {=;}ї are normal random variables with mean zero and variance o? (This is usually 
written as Y; ~ N (Bo + 6,2;,07) and e; ~ N (0,0?).) In this case we say that the errors 
are homoscedastic (otherwise they are heteroscedastic). 

Since maximum likelihood estimation is known to be optimal for estimating the mean 
р of a normal random variable, in that it gives Йй = т, the sample mean, which is the 
minimum variance unbiased estimator of p, it is reasonable to consider this approach for 
the estimation of (80, 61) in (3.4). 

Now if 

ы Е des 4 3.5) 

Уто ехр 2g? (yi Bo Ву) ( * 
is the density of Y;, then because {Ү; } ү are assumed to be independent random variables, 
the likelihood function L is given by 


п, 


L = П x exp l-z (yi — Bo — лд! | 


1 lox 2 
= (Vana) exp E 3 (yi — Bo — 842) | (3.6) 


As usual, the maximum likelihood estimates of (85, 8,) are obtained by maximizing L 
with respect to (80, 81). Since L is a “negative exponential", to maximize it it suffices 
to minimize the exponent 


z 223 (yi — Bo — VEAN (3.7) 
i=1 
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in (3.7). Since o? > 0, it suffices to minimize 


TL 
5 = 3 [Ji а Ву)? | (3.8) 
1=1 
This can be done conveniently using calculus methods and will be pursued here. An 
alternative algebraic proof is given in Section 3.4. 
Taking the partial derivatives of S with respect to (80, 8,) and setting these deriva- 
tives to zero enables us to find the minimizing values. From (3.8) 


OS ш 
——=-—2у (y: — bo — 511), (3.9) 
дб i=l 
and 
05 ш 
—— =-2)у (yi — Bo 8,0). (3.10) 
д8; i-1 
Setting the right hand side of (3.9) to zero and rearranging gives 
nf = wi -bX zi (3.11) 
i=1 i=1 


where the “carats” over (f, 81) indicate that these are the estimated values. 
Solving (3.11) for 8, gives 
Bo — y – 82 (3.12) 


where we have used the notation 
1° 1° 
g л апа „32 (3.13) 
To obtain an expression for д; we set (3.10) to zero giving 
n : n И n 
У ziyi BS ri- д, Уа? = 0, (3.14) 
i=1 i=l i=1 
and substituting (3.12) for By in (3.14) gives 
y» (7 Е дут) nz — буут? = 0. (3.15) 


Thus, 


(s — У) а) Bi = паў — x T (3.16) 


— 1 1 
Л = - > r; and Ty = = 2 „210, (3.17) 
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and solving (3.16) for f, gives 


B = 20—10 Ma DUD 


iur ше 3.18 
| gi-m S um-s) HS 

Having obtained 3, from (3.18) we may then calculate 8, from 
Bo — y — 20; (3.19) 


with B, given by (3.18). 
Once f, and f, are determined from (3.18)-(3.19), we can obtain estimates of the y 
values from the fitted line А 
й = В + 1. (3.20) 


In particular, we can estimate E (Ү;) by 


Gi = By + Bi. (3.21) 


'The difference 
ё; = yi — Îi (3.22) 


is called the i-th residual and its size is an indicator of how well the line y — Bo +8 12 
fits the observed data. The residuals may also be viewed as estimates of є;, and their 
examination plays an important role in assessing the reasonableness of the assumptions 
given for (3.1). 

Before proceeding with the estimation of с? we make several comments concerning 
the MLEs of (8o, 81). First, even if the errors are not normal, we can still estimate 
(80,8) by minimizing 5 which is a sensible procedure if we consider the geometric 
interpretation of this process. Suppose that y — Bo + fx is the equation of some line 


which we estimate to fit the “true line" y = Bo + B1z. Then yi — (% + B 1 н) represents 


the vertical deviation of the point (z;,y;) from this estimated line and 
i x К 2 
S=) (ш - Bo - Аа) (3.23) 
i=l 


is a measure of how well the estimated line fits the data. It is reasonable to choose as 
a candidate for the best line one that makes (3.23) the smallest. If we choose | £5, 5 


by minimizing (3.23) to do this, we will arrive at (3.18)-(3.19) as our estimates of the 
intercept fy and slope ү, in agreement with the MLEs when the errors are normal. In 


this case (Bo. 8; are called the least squares estimators and the resultant fitted line 


у= Bo + (Ж the least squares line. Since many measures of goodness of fit other than 

(3.23) are possible, one justification for using the least squares estimators for arbitrary 

error distributions comes from the fact that when the errors are normal the least squares 

and maximum likelihood estimators agree. For other error distributions they will not. 
As an example, suppose that the density of e; is given by 


fe, (€) = 5 exp (= lel). (3.24) 
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In this case e; has a Laplace or double exponential distribution. If the errors are inde- 
pendent and identically distributed according to (3.24), then the density of Y; is given 
by 


fy (би) = 5 exp (= lui — бо— 84). (3.25) 


and the likelihood function for n independent observations is given by 


L= E exp (- у. ly; — Bo — Av А (3.26) 


2n 
1=1 


In this case the MLE of (8,, 81) is given by minimizing 


Ун — Bo В| (3.27) 


i=l 


and the values of (Bo, B,) minimizing (3.27) will in most circumstances differ from those 
given by (3.18)-(3.19). As a numerical example, consider the three points (0,0), (1,0) 
and (1/2,1/2) shown in Figure 3.1. The least squares line is easily found to be y = 1/6. 
For this line the value of (3.23) is 2/3 and so does not minimize (3.27), since for the line 
у = 0 the value is 1/2. In fact, the values 8, = 9, = 0 minimize (3.27) in this case. 


Least squares line 
MAD line 


0.0 1.0 2.0 30 X 


Figure 3.1: Least squares and Minimum Absolute Deviations (MAD) lines 


On the other hand the least squares estimator is the best linear unbiased estimator 
(BLUE) (see Section 3.6), and this fact, the Gauss-Markov theorem is often invoked to 
justify its use even when the errors are not normal. For the most part in this text we 
will concentrate on least squares estimation. 

Although we have assumed that the z's are known exactly, this may not always be the 
case. Measuring instruments are not 100% reliable and in practice values of both (z, y) 
are often rounded to some convenient value. In such “errors in variable" (ЕГУ) models 
other estimators are often preferred to the least squares ones. А popular technique for 
doing this is to minimize the sum of the perpendicular distances of the data points from 
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the line y = Bo + B 12 with the coefficients (Bo, B,) , how being determined by minimizing 


у (3.28) 
1=1 


where d;,1 < ? € n are shown in Figure 3.2. This technique is called orthogonal least 
squares in the statistics literature. This distance measure does not favor the т variable, 
as does ordinary least squares, but rather treats both т and y variables equally. It is 
also known as the method of total least squares in numerical analysis. It also needs to 
be noted that ordinary least squares has some problems in EIV models [69, 13]. 


Orthogonal 
least squares 
i distance 


(x у) 


Ordinary 
least squares <— 
distance 


wn (Х,У) 
Least squaresline 
у= Bot В;х 


Figure 3.2: Distance minimized by orthogonal least squares 


As shown in Figure 3.2, for a particular data point (=, у), the point (2,4) on a line 
y = Bo + 8,2 that is closest when we measure distance perpendicularly is given by (See 
Exercise 3.4) 
iy +s- bobi . 4 2 B үә P 
| 2021 fies г (biy +2 - Byb,). (3.29) 
14, 1+8; 


2 = 


3.2.1 Algebraic Derivation of the Least Squares Estimators 


For those readers who have not had calculus, or for those who wish to see a purely 
algebraic derivation of the least squares estimators we supply one here. 
First consider 


Yi — Во — biti = YW-Y¥-Bo- 8} (2:2) +7 — 8,2 
(yi — V) — bı (zi — T) + (y — 812 — Bo) (3.30) 


where 2 = 1,2, ...,n. 
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Squaring each of the n terms given by (3.30) and adding yields 


Y» ву i— 8) Y - + iz) 


1—1 


з (yi — By — B,2:)° 
-38, Y (5-2) (& 8) - 23 (ve — 3) (Bo — T+ 8,2) 
i=1 


i=l 


+28, У (ai — 2) (69 — y 4 8,2) (3.31a) 
i=l 


= у (и-% +1) (ei – 2) +n(bo-7 + 8,2) 
i=l 


чә. 
il 
p 


-28, у (zi — T) (yi — 9) (3.31b) 
i=l 


where the last two terms in (3.31а) are zero because У). (zi – T) = 35; 4 (yi — V) = 0. 
Now consider the term in (3.31b) 


BY. - z) ? 96,3 (s; — t) (yi – 9). (3.32) 


i=1 


Completing the square in (3.32) it becomes 


= _\2 | 22 Mea — TZ) (v: – 7) 
Zi Ж ү рү S a 
» [ d Уд ЕЕ) 


[Eme my. n emen]. Bes 
iiti F) Уа (Ti — E) 


and using this in (3.31b) we get 


y» — By + Byz)? - Nc = Hapeta] 


i=1 De (=F) 
+n (bo -Y + Вт)? + 2. (zi — т)? 
а-ти] i 
[ 7. - 3). | (3.34) 


Since the first two terms in (3.34) do not depend on (8o, 8,), the sum of squares of the 
residuals will be minimized by setting the last two terms to zero gives 


(bo — 9 + 8,2)? = 0, (3.35) 


[ _ ern zu (3.36) 


and 


p» (zi — 1) 
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Solving (3.35) and (3.36) gives 
Bo — y — Biz, (3.37) 
and x _ _ 
B= Эз 7 mw V) (3.38) 
2 i= (Ti — T) 
which agree with (3.18) and (3.19). 
Note that we have obtained a little more with this argument than using calculus, 
since (3.34) shows that (Bo, ё) actually minimize (3.8), rather than just providing а 


critical point. We will expand on this approach in Chapter 5 when considering multiple 
regression. 

As we have already pointed out, most regression calculations of any significance usu- 
ally require a computer to do. However, in order to get a “feel” for the nature of the 
calculations and the basic concepts one should probably do at least some hand compu- 
tations. (Pain and suffering are good for the soul.) For this purpose, we have chosen as 
our first example a simple three point model of no particular importance. More problems 
with computer generated results will follow. 


Example 3.1 Find the MLEs for (8o, 81) in (3.1) when (z;, yi) have the values 
(1,1), (2,4), (5, 7) and the errors are assumed to be independent N (0, o°). 

From our previous discussion the MLEs of (8o, 81) are given by (3.18)-(3.19). To 
obtain these values we use a classical computing format illustrated in Table 3.1. 


Table 3.1 Calculations 


т у т TY 

1 1 1 1 

2 4 4 8 

5 7 25 35 
"V^ o. 8 _ S .4. « 4A 
$js-&z-2 y-4 z-10 B= > 
X SM ызны ду р RM MR dM, So 


In each column we list the values indicated. z-values of z;; y-values of y;; z?-values 
of 22; and zry-values of r;y;. The last number in each column is the average of the 
corresponding column values. The various averages are then used in (3.18)-(3.19) to give 
(Bo, 8 :) : 

In this case we have 
27у — ху 44/3 – 32/3 36 


Ву = $= 13846154 3.39 
со ооо oy 

and 96 
By = y - BE = 4 — 5g = 0.3076923. (3.40) 


Although (3.18) is easy to remember due to the symmetric appearance of zy, Fy etc. 
it is not always the best formula to use for numerical purposes. 

For one thing, it is quite prone to round-off error and for machine calculations the 
equivalent form (left as an exercise) 


By Y^. THE zy? (3.41) 
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is perhaps the better choice. 
For theoretical purposes the expression 


т т р 3.42 
i oru 2); 


obtained from (3.41) by using the fact that $77 y (z: — T) = 0, is often preferred. The 
following example illustrates the use of (3.42) numerically. 


Example 3.2 An experiment was conducted to determine the relationship between 
the percentage of a certain drug in the blood stream and the reaction time to a given 
stimulus. The results are shown in Table 3.2. 


Table 3.2 Drug and Reaction Time data 
Subject Amount of drug Reaction time 


No. т, 96 y, seconds 
1 1 1 
2 2 1 
3 3 2 
4 4 2 
5 5 4 


The scatter plot shown in Figure 3.3 suggests a possible linear relation between y and 
т. Assuming that the model 
Y; = Bo +812; Ti i= 1,2,..,9 


where {м are independent N(0, o?) random variables is appropriate, find the MLEs 


of (80,81). 


4 
> 
5 
Е 3 
kK 
© 
5 2 
Q 
Со 
Ф 
Ce 


| 2 3 4 5 
Amount of Drug, x 


Figure 3.3: Scatter plot for drug and reaction time data 


Since the €,;’s are normal, the MLEs of (8o, 3,) are the least squares estimates. To 


60 


CHAPTER 3. SIMPLE LINEAR REGRESSION 


obtain these we use the following tabular set-up similar to that in Example 3.1. 


Table 3.3 Calculations for Drug and Reaction Time Data 


сл w м KH] 8 


y 2-7 (z — 7) у(= – 1) 
1 -2 4 -32 
1 ~1 1 -1 
2 0 0 0 
2 1 1 2 
4 2 4 8 


й=2 S@-=0 Pe- =10 Yy@-D=7 


T=3 


From this we find that 


and 


The fitted line is then 


A graph of this 


ў = —0.1 + 0.72. 


line superimposed on the scatter plot is shown in Figure 3.4. 


Reaction Time, у 


Amount of Drug, x 


Figure 3.4: Fitted line for drug and reaction time data 


We will now look at several more complicated examples of simple linear regression 


models. At thi 
empirical data 


5 point we caution the reader to take our model assumptions concerning 
with a grain of salt. In general, one can never prove conclusively that 


the assumptions used to fit some data are ‘true’. The best one can hope to do is to 
make a fit given our assumptions, and then use a variety of statistical techniques, to 
see if our results appear to be consistent with these assumptions. If we feel they are, 
then we can entertain the hypothesized model as reasonable, otherwise we must modify 
our assumptions, refit and continue until we are satisfied. This process is only partly 
mathematical and different analysts may come to different conclusions for the same data. 
But we must start somewhere and (3.1) is often a good choice. 
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Example 3.3 To see what (3.18)-(3.19) will do when (3.1) is known to be true, we 
consider the following example taken from [40]. 

There five independent observations were made of a N(0, 1) random variable and the 
values added to the five points (1,8), (2, 11), (3,14), (4, 17), (5, 20) on the line y = 5 +32. 
These observations gave rise to the data in Table 3.4. 


fetes ОБ anos ay). 


c y 
I 7.695 
2 10.679 
3 15.900 
4 16.222 
5 20.617 


In this case the reader can verify that Bo = 4.81 and By = 3.14, showing that the 
least squares line provides a reasonable estimate of the true line, even for a small sample 
size. 

For visual comparison the true line, estimated line and observed y values are shown 
in Figure 3.5. 


true line 


least squares line 
Y= 4.81+ 3.14x 


Figure 3.5: Comparison of true and fitted line for Example 3.3 


Example 3.4 (Drink delivery data) To illustrate the further use of regression analysis 
in developing an empirical model, consider the following problem taken from [87]. 

A drink manufacturer is interested in determining if a linear relationship exists be- 
tween the time y (in minutes) it takes to deliver an order and the number of cases 
delivered, х, is reasonable. For this he has available the data in Table 3.5. A scatter plot 
of these data is shown in Figure 3.6 and perhaps with the exception of the 9th observa- 
tion the points appear to fall along a straight line. Thus if (3.1) is the true model, (we 
don’t know if it is) then д, and 8, may be estimated by least squares. 
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Table 3.5 The Number of Cases т and Delivery Time у 
Cases Delivery time . Cases Delivery time 


Since the calculations here are tedious, values of Bo and By were obtained (by using 
MINITAB or S-PLUS). They are 


Bo = 3.207798, B, = 2.1761668 


and the estimated line is shown in Figure 3.7 superimposed over the scatter plot. Without 
any formal assessment, the fit appears to be good and so at this point (3.1) seems to be 


a reasonable model of the given data. 


80 
70 
60 
50 
40 
30 


Delivery time, y 


0 10 20 30 
Number of cases, x 


Figure 3.6: Scatter plot for delivery data 
Example 3.5 (Tractor Data) The cost of maintaining a tractor in a given year 


appears to increase linearly with the age of the tractor. The following data were collected 
to examine this hypothesis. If a linear model is appropriate, find the least squares 
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Delivery time, y 


20 
Number of cases, x 


Figure 3.7: Least squares line for delivery data 


estimates of (8o, B, ). 


Table 3.6 Ages of Tractors and Maintanance Cost 
Obs. Age Cost Age Cost 
No. = (years) у (8) х (years) у ($) 


о о чолом н 


Using (3.18)-(3.19) we find that 
Bo = 323.622 апа 0, = 131.716 


and so the fitted least squares line is 
9 = 323.622 + 131.7162. 


This line suggests that maintenance costs appear to increase at a rate of about $131.72 
per year. 


Example 3.6 (Birth weight data) It is of interest to determine how the weight of a 
newborn baby depends on the length of the gestation period. To obtain a relationship 
the weights of 24 babies were measured (in grams) and the length of the corresponding 
gestation period (in weeks) was recorded. A scatter plot of the data is shown in Figure 
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3.8. A linear trend is observed so a line was fitted by least squares. (In fact, the plot 
looks more like two parallel lines.) The values of (Bo, 8.) are given by 
By = —1393 and д, = 113.20. 


The result of this calculation indicates that the weight of a baby increases by about 
113 grams for each week in the womb. 


Table 3.7 Gestation Period and Weights of Newborn Babies 
Obs. No. Аре (х) Weight (y) | Obs. No. Age (xz) Weight (y) 


1 40 2968 13 40 3317 
2 38 2795 14 36 2729 
3 40 3163 15 40 2935 
4 35 2975 16 38 2754 
5 36 2625 17 42 3210 
6 37 2847 18 39 2817 
7 41 3292 19 40 3126 
8 40 3473 20 37 2539 
9 37 2628 21 36 2412 
10 38 3176 22 38 2991 
11 40 3421 23 39 2875 
12 38 2975 24 40 3231 
3500 
= 
+ 3000 
= 
D 
a 
= 
2500 


35 36 37 з ә 30 4 2 
Number of weeks (x) 


Figure 3.8: Scatter plot of weight and weeks for birth weight data 


After we found the estimated linear regression model between the gestation period 
and weights of the twenty-four newborn babies, y = —1393 + 113.20z, we calculated the 
residuals, 2; = y; — $i, i = 1,2,...,24, and plotted them in Figure 3.9. 

As we shall see, the residual plot provides us with important clues for checking the 
assumptions about the postulated model. We will discuss this in more detail in Section 
3.9. 
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Residual 


35 36 37 38 38 40 41 42 
Number of weeks (x) 


Figure 3.9: Plot of residuals ê = y — jj for birth weight data 


3.3 Estimating о? 
As for (80, 81) the model error о? in (3.1) can be estimated by maximum likelihood. 


This can be obtained by differentiating L with respect to c?. The result of this (see 
Exercise 3.5) calculation shows that 


ÔMLE = 237 -= ĝi}, (3.43) 


where ĝi = Bo + Bizi. Thus the MLE of c? is given by taking the average of the sum 
of squares of the residuals; an intuitive choice. By taking the square root of о? в we 
obtain the MLE of o, 


(3.44) 


In this case the MLE is not the customary estimate of o?; rather the usual estimate 
of c? is given by 


8 = Уи) (3.45) 


and c is then estimated by 


(3.46) 


which is usually called the standard error of regression. 
The reason for choosing s? rather than 64,,, to estimate о? is that s? provides an 
unbiased estimate of о?, regardless of the distribution of є;,1 < i < n, whereas, 07,77 
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is biased, even for normal errors. (However, s is a biased estimate of с.) On average, 
52 гр underestimates o°. 

Using n — 2 in the denominator rather than n, is analogous to dividing by n — 1 
rather than n in the usual estimate of о? for random samples. In each instance one loses 
a "number of degrees of freedom" equal to the number of estimated parameters. The 
unbiasedness of s? will be demonstrated in the following section. 

Although most regression calculations are done using calculators and/or computers it 
is possible to calculate s? without forming all the residuals 2; = y; — Bo — B 12,1<%< тп. 
This can be done according to the formula 


52 = = : 7 z mam =й, У (т;— = (3.47) 


i=1 


This usually results in somewhat less computation than using (3.45) directly. To obtain 
(3.47) consider 


5 (v. a Bo = TAN 


| 
М 
= 
| 
el 
+ 
T 
Xl 
| 
Ss 
TW 
-— 
N 


= Уу, [c - 9) - Ay (zi -| 
i=l 
= 3-9 -25 (ж т) (и +1} (s 2) 
i=l i=1 ah 
(3.48) 
Using 


(3.48) becomes 
> w-9- By 3 (ai - ж). (3.50) 
i=l ї=1 


As a further matter of notation the numerator in the expression for s? will often be 
written as SS E; short for ‘sum of the squares of the errors’. Then 


s? = SSE/ (n — 2). (3.51) 


Example 3.7 Assuming the errors in Example 3.3 have constant variance c?, find 
an estimate of c? by using s?. Here to illustrate (3.45) and (3.47) we will calculate s? in 
two ways. 

First, using (3.45), s? — SSE/ (n —2), so we need to obtain SSE. The relevant 
calculations are given below. 
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y $9 0-9 (v-$) 

1 06 0.4 0.16 
1 13 -03 0.09 
2 20 0 0 
MEE v0.9 0.49 
4 34 06 0.36 


SS E = 0.16 + 0.09 + 0+ 0.49 + 0.36 = 1.1. 


Thus, 
s? = 1.1/3 = 0.3667 and s = 0.6056. 


To use (3.45), first we need to calculate 


Y^ (yi - 3)? = (1)? + (-1Y + (0)? + (0)? + (2)? = 6. 


Then, 


5 5 


82 = = |у (yi 7) – B NICEE | = : |o — (0.7)? 10] = 0.3667 


1=1 i=l 


and s = 0.6055 as before. 


Example 3.8 Calculate the standard error of regression for each of Examples 3.3, 
3.4 and 3.5. 

Here, because the calculations are tedious we do the calculations using MINITAB or 
S-PLUS. The results are: 


(1) for Example 3.3, s = 1.204; 
(2) for Example 3.4, s — 4.181; 
(3) for Example 3.5, s — 283.479. 


We noted above that the reason usually given for using s? rather than the MLE of 
о? is that s? is unbiased, while 64,; p is not. However, s is generally a biased estimate 
of o (as is бмгк). As we now show this is a general property of unbiased estimators of 
the variance. 

Let s? be an unbiased estimate of o°. Then E(s) < с. To see this, note that for any 
random variable X with a? (X) < oo, that 


E (X?) = Var(X) + [E (X) . (3.52) 
Taking X — s in this equation gives 
E (s?) = Var(s) + [Е (s) (3.53) 
But E (s?) = о?, so that 
c? > [E (s) (3.54) 


or E (s) < о, with equality holding only if Var (s?) = 0. This occurs only if P {s?=0} = 
1. For normal random variables (see Chapter 2) s? is proportional to a X? random variable 
with one degree of freedom. Hence, for normal random variables, E (s) « о. 
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3.4 Properties of (Bo, 81, 5?) 


We now consider some properties of (Bo, à.) and of the estimator of variance s?. These 


properties will be important in developing confidence intervals and significance tests for 
the parameters. 


Theorem 3.1 Let (Bo, By) be the least squares estimators of (Bo, 81) in the linear 
model 
Y; = Bo + B12; + £i, 1<Ф%<п, (3.55) 


where €;,1 < i < n, are independent random variables (it is sufficient that &;,1 € i € n, 
be uncorrelated) with common variance с?. Then, 


(i) E (1) = 81; (B, is unbiased) 

(ii) Е (Bo) = Bo; (Bg is unbiased) 

(iii) Var (4,) E^ Su. See = Уу EES T); 
and 

(iv) Var (%) = о? [1/n + T sl: 

(v) In addition, if £j, 1 € i € n, are N (0,07), then 


bi~ N [Л Var (8;)| ‚1=0,1. (3.56) 


(Note: Since the proofs of Theorem 3.1 and Theorem 3.2 are а little long, some 
readers may wish to skip them at this point. It is advisable, however, to know the 
properties stated in these theorems as they will be used repeatedly in the remainder 
of the Chapter.) 


Proof. (i) From (3.18) 


А lu Т Tt iw _ 
PR) m Bla ee a = > Gem EOD 
= з. Ly (HF) 0+ Ba) 
Во <= Au! X 
= (аге ту (zi — T) T; 
© 2. > 
ea eee (3.57) 
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(ii) From (3.19) and (i) 


e (h) 


E 


|| 


E(Y – 17) = E (Y) -zE (Ay) 
~~ 5| -zp == Y (8s + Bim) – fit 
i—1 1=1 


= Bot by — Вт = Bo. (3.58) 


(iii) Again using (i) and the independence of {Y;}, we have 


Var (д) = Var oe >, (ж; — 2) Y; 
1c ES 
= a 2 — T) Var (Ү;) 
Tr j=] 
o? S, c? 
= UE re | 3.09 
See NE 
(iv) From (3.19) By = Y — 8,7 so that 
Var (40) = Var (Y — Bz) 
= Var (Y) -x?Var (д) — 27C ov (Y. i) 
с? Tro? ‚ v^ 
mo + on 2rCov (¥, By) І 
Hence, to complete the proof of (iv) we need to show that Cov tz By) = 0. 
Now using (3.48) 
T S ae 
Cov (Y. i) = Со |Ү, с 2, — 2) Yi 
j a _ = 
Em Ў: (x; — т) Соо (Y,Y;). (3.60) 
i=1 
Since, Y;, 1 <i € n, are independent, Cov (Y;, Yj) = 0,4 Æ j, so that 
Cov(Y,Y;) = C Sy yy.) 1ле ҮҮ, 
ov (Y,Y;) 9200 "A ge ov (Y, Yi) 
j=l j=l 
1 o? 
= —[Cov(Y;,Y;)] = — 3.61 
L [Cov (Yp Ya] = Z (3.61) 
and substituting (3.61) into (3.60) gives 
ED о? Ш _ 
Cov (ЎА) PR Y (ai 2) — 0, (3.62) 
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so that 


Var (ĉo) = о? E nis | (3.63) 


m^ Sex 


as required. i 

(v) Using (3.18) and (3.19) one easily finds that д, and д, is each a linear combination 
of the independent normal random variables Y;. From Section 2.8 this sum is also normal 
with the means and variances given by (2.101)-(2.102). M 


Theorem 3.2 Under the assumptions of Theorem 3.1, E (s?) = o°. That is, s? is 
an unbiased estimator of о?. 


Proof. Since E (Ӯ, = E (д, + 842; ) = bo + bix: = Е(Ү;), 
Е [C е ¥) | =Var (Y B f) | (3.64) 
Thus, 


= x Var (Y B fj 
i=1 


= S Var) Y) +} Var (ў ) -25 co (v. 
= no 249 Var (Ӯ ) 33:6» (v.i). (3.65) 


We now consider Var (9i) which is given by 


Var (9) = Var (Bo F Йу) 
— Var (40) + z?Var (1) + 2z;Cov (Bo, Bs) 
2 


-—Á E I | ET а + 2х;Соь (0,81). (3.66) 
But, 
Cov (8,01) = Cov (У Bim i) 
= Cov (Farr) — Соо (81,81) 
= CH сЕ Bs) — zVar (д) (3.67) 
and using (3.62) and (iii) of Theorem 3.1 we get 
Cov (Bo, Bs) = -22, (3.68) 
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Hence, 


Var (%) = g? |+ 7 T zi = d 


| 
q 
ко 
co 
|^ 
+ 
—, 
8 
| 
&l 
М7 
© 
| a | 


Thus, the second term in (3.69) becomes 


n Н c? n 
> Var (Ӯ) = 0? + ©. (zi — т)? 
1=1 TT 1—1 
2 о? 2 


Finally, we calculate 3o Cov (5 m For this we have 


Cov (Ya fi) = Cov (у, Bias) 
= Cov (Ys Bo) + z;Cov Uu ' 
But, 
Cov (Ya ê) = = x (xj; — т) Cov (Yi, Y;) 
RC c? (x; — T) 
5тт 
апа 
Соъ (Yibo) = Cov (vi, Y - 22, ) 
= Cov(¥i,Y) zCov (f...) 
. @ o?(z;—Z) 
соп Sox 
so that 
Cov (9) _ а . mo? ш – т)? E Eun — т) 
_ c? Е с? (2; E т)? 
(n Su 
Thus, 
mo MNCNPE Id = 00 
З (0.9) a + g— 2, T) 
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(3.69) 


(3.70) 


(3.71) 


(3.72) 


(3.73) 


(3.74) 


(3.75) 
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Using (3.75) in (3.69) gives 


n „\2 
Е Уу (n-£) = по? + 202 — 402 = (n — 2) о? (3.76) 
i=l 
so that 
1 < ^2 
2 — | — 
Een у er fi) 
c? 2 
= Lja-2-e. (3.77) 
p 


Last we observe that if the errors {є;} are independent and N(0,c7), then the dis- 
tribution of (n — 2) s?/a? is x? with n — 2 degrees of freedom, and surprisingly, s? is 
independent of Bo and д1. These results are important, as we shall see, in developing 
test procedures for the simple linear regression model. The proofs of these latter two 
properties are somewhat technical, and will be obtained as a consequence of a more gen- 
eral argument for the multiple regression model in Chapter 5. Again these facts should 
be remembered, even if the proofs are omitted. 


3.4.1 Standard Errors of the Coefficients 


Once we have estimated c?, then we can obtain point estimates of the variances and 
standard deviations of (Bo, B 1). From Theorem 3.1 we find that if the errors аге uncor- 


related and have common variance c?, then Var (ĉo) = в? (%) = 02 [1/n + z?/8,.] 


and Var (8) Go (ĉa) = c?/S,,. Thus, unbiased estimates of these quantities are 
given by 


2 
ô? (20) = 52 Ё $ =| = 5260, (3.78) 
and 5 
ô? (8) m а = 5253, (3.79) 


where бо and бү are referred to аз the variance multiplication factors (УМЕ). The stan- 
dard deviations of By and 3, are then estimated by 


д (%) = 54/60, (3.80) 


and 


ó (ё) = 84/51. (3.81) 


б (%) and 6 (%) are called the standard errors of the estimates (Bo, B,). As we shall 


see, these standard errors play a fundamental role in obtaining confidence intervals for 
бо and £, and in developing hypothesis tests for these quantities. 
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Example 3.9 Find the standard errors of (Bo; By) in Example 3.2. 
Here the calculations are simple enough to be done by hand. We find from the 
calculations in that example that 2 = 3 and 5,, = 10, so that s = 0.6056 
бо = 1/5 + 9/10 = 11/10 and 6, = 1/10. 


Thus, 


ё (Bo) = 0.60554/11710 = 0.6350, 


ô (д,) = 0.6055,/1/10 = 0.1915. 


Example 3.10 Find the standard errors of the coefficients in the drink delivery 
model in Example 3.4. 

Here the calculations are too tedious to do by hand so we read the values directly 
from the computer output. They are 


and 


6 (By) = 1371 and ô (21) = 0.124. 


Notice in this case that the standard error for д 1 1$ quite small in comparison to that 
for Bo which is 1.371. This suggests that the “true” value of 8; is probably quite close 
to its estimated value Bi: The larger standard error for Bo indicates that we should have 
less confidence in the value of Bo as an estimate of Bs. These ideas will be formalized in 
Section 3.6. 


3.5 The Gauss-Markov Theorem 


As we have seen, when the errors in the model (3.1) are normal, then maximum likelihood 
estimation is a justification for the use of least squares estimators of (80, 8,). However, 
when the errors are not normal, then as shown in Section 3.2 this is generally not the 
case so it is important to ask what statistical justification there is for using least square 
estimation when the errors are not normal. In such circumstances, it can be shown 
that the least squares estimators are the minimum variance unbiased linear estimators 
of (8o, 81). This optimality property, the Gauss-Markov theorem, is often invoked to 
justify least squares estimation, even for non-normal errors. On the other hand, we 
hasten to point out that there may very well be nonlinear and/or biased estimators of 
(Bo; 8,) which are more efficient than least squares for error distributions which depart 
markedly from normal. This is a currently active research area, often called robust 
regression analysis and some aspects of this subject can be found in [27, 65]. 

We now proceed to a statement and proof of the Gauss-Markov theorem for simple 
linear regression. The generalization to multiple regression will be taken up in Chapter 
5. 


Definition 3.1 Consider the model Y; = B, + 8,2; + &;. A linear estimator Въ = 
0,1, for 8, is one of the form 


B; У ек, = 0,1, (3.82) 
k=1 


74 CHAPTER 3. SIMPLE LINEAR REGRESSION 


where сук, 1 <k € n are given constants. 
Note: Observe from (3.18) and (3.19) that the least squares estimators of (f, 81) are 
linear. 


Theorem 3.3 Consider the model 
У; = Bo + 8ү®-+Є&,1<1< п, (3.83) 


where €;,1 < i € n, are uncorrelated and have constant variance Var(é;) = с?,1<1< т. 
Then the minimum variance unbiased linear estimator of Bj, j = 0, 1, is the least squares 
estimator. 


Proof. We will consider the proof for 5}, since the proof for £y is similar. Now 


suppose that 
pce aye (3.84) 
i=l 


Then, since €;, 1 < i < n, are uncorrelated and have the same variance, 


Var (21) = Svar t es | (3.85) 


and for 3, to be unbiased we must have E (д) = pı. This gives 


E (B,) = 2 SEQQ) = у о (o + Bie) = Bs (3.86) 


Since (3.86) must be true for all 5) and 61, we have, equating the coefficients of 6 


and f, in (3.86) 
2.6 = (0 and B = 1. (3.87) 


Thus our problem reduces to minimizing У)? у c? subject to (3.87). The solution to this 
problem may be found by using the technique of Lagrange multipliers |104, 27]. That is, 
we choose (cj, 1 € i < n, Ау, А) to be critical points of 


= У 0-22; x J zx (> сі; ) | (3.88) 
i=1 i=1 i=1 
From calculus this can be done by solving the equations 


OQ д) oQ 
= <1< п, —— == =0. . 
De, £90,154 m, Da Өм 0 (3.89) 


Carrying out the differentiations in (3.89) yields 


2с; = 2X1 = 2052; = 0, 1 < 1 < п, (3.90) 


1=1 1=1 
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To solve (3.90)-(3.91) add up the equations in (3.90) to give 


n 


be C; — NAI = А Y Ti = 0. (3.92) 


i=l i=1 
Usng У) с; = 0, it gives 
пх} + № У’ Ti = 0. (3.93) 
i=1 


To get another equation involving (A1, А) multiply each equation in (3.92) by 2; and 


sum again. We get 
TL п, TL 
>D Citi — А, 3 Zi — А У’ x =). (3.94) 
i=l i=l i=1 


Since $77 , Ciz; = 1, (3.94) becomes 


oF + № Уа? = 1. (3.95) 
i—l 


il 


Solving (3.93) and (3.95) for (41, Аз) by using Cramer's rule [see Chapter 4] gives 


nr TL 
A, = det : D = det 4 2 a ў 
l т 25 i PE 2; 


= o ^im oo (3.96) 


and 


А 


Il 
£u 
Ф 
ect 


п, y Ti 
Misan 
———ÁÁ—Á— (3.97) 


where det denotes the determinant. 
Dividing the numerator and denominator in (3.96)-(3.97) by n and using the fact that 
У 2? — пт? = Srs gives 


i 


X 1 Ti—T 
А, = E А № = S and C = S. (3.98) 
so that 
я p a 
д = = 3 (zi — z) Yi, (3.99) 


which agrees with the expression (3.18) for the least squares estimator. M 


Note: We have really only established that (Ау, Аз, ¢;, 1 < i < n) is a critical point of 
Var (д 3 That c;,1 € 4 € n, actually provide a minimum will be established in Chapter 
5. 
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3.6 Confidence Intervals for (8p, 81) 


In most statistical situations it is desirable to have confidence intervals for unknown pa- 
rameters rather than just point estimates. These interval estimates discussed in Chapter 
2 provide, in a certain sense, error estimates for a given point estimator, and as we shall 
see, they play a vital role in regression analysis. 

As in problems in elementary statistics, the knowledge of the distribution of the point 
estimates is crucial in obtaining good confidence intervals, and to do this we will make 
normality assumptions. However, in many circumstances these confidence intervals are 
used even if the errors are not normal. This can often be justified because the least 
squares estimators are linear in Y;, 1 < i € n, which enables one to appeal to the Central 


Limit Theorem to obtain the approximate normality of (Bo, B 1); even if the errors are 


not. 
When ¢;,1 < i < n, are independent and N(0,07), then as we stated after Theorem 


3.2, (n — 2)s?/o? is x? (n — 2) and is independent of each of (Bo, Bs). It then follows 
that the random variables 


ИЛИ? 
e (&) 
each have a T-distribution with n — 2 degrees of freedom. This is true since 


= (& - 6) е (8) | =0,1 (3.101) 


i 
s/o i 


T, ‚1=0,1 (3.100) 


where c (2) = oy ði, б (8) = sV/ó;, and 6; is given by either (3.78) or (3.79). Then 
(ё, — B;) /o (8.) is N(0, 1) and is independent of s/c, which is the square root of a 


x? random variable divided by its degrees of freedom. From Section 2.8 this random 
variable is well known to have a t-distribution. 
If 2,2. ау2 is the upper 2/2 percentage point (1 — a/2 percentile) of a t-distribution 
with n — 2 degrees of freedom, then 
PÍ-t, 25/2 € Ti &ta 29/3) 21-0, = 0,1 (3.102) 
so that 
Р —tn-2,a/2 < Pi < tn-2,a/2 =1— Q, i = 0, 1. (3.103) 
e (8:) 
Solving the inequality in (3.103) for 8; shows that 
Р 18, = 230/26 (.) < 8, < B, + in 2,5 /26 (8.)) =l-a,i=0,1. (3.104) 
According to the definition of a confidence interval, the pair of random variables 


Q o tn—2,a/27 (8) 0; F tn—2,a/2f (2,)) ‚= 0,1 (3.105) 


is а (1 а) x 100% confidence interval for 8;,ї = 0,1. 
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For a given sample the value of the confidence interval is obtained by substituting 
the values of the estimates 6o, 6; and s into (3.105). As is customary, we will make no 
notational or verbal distinction between the confidence interval and its value. For the 
convenience of the reader the specific formulas obtained by using (3.106) and (3.107) are 
given below: 


(1) Confidence interval for Д6: 


А 1 z 
Во Е tn—2,a/2° 8 Buc (3.106) 


(2) Confidence interval for 6, : 


A 8 
@ E tn-2,a/2' ==) г (3.107) 


One should observe from (3.106) and (3.107) that in most practical cases, the con- 
fidence interval for 8y will be wider than that for 8,. That is, the slope is generally 
estimated with greater precision than the intercept. This will be apparent in the follow- 
ing numerical examples. 


Example 3.11 Find 95% confidence intervals for (89, 8,) in the drug and reaction 
time data in Example 3.2. 

From Table A.2 we find that їз 025 = 3.182 and from (3.106) and (3.107) 95% confi- 
dence intervals are given by 


Bo + 3.182 x ё (Bo) = —0.1 + 3.182 x 0.6350 = (—2.1206, 1.9206) 


and 
В, 53.182 x ô (8.) = 0.7 + 3.182 x 0.1915 = (0.0906, 1.3094) . 


Example 3.12 Find 95% confidence intervals for (£5, 81) in the soft drink delivery 
model in Example 3.4. 

Here n = 25, n — 2 = 23 and from Table A.2 to3 925 = 2.069 so that using the results 
from the computer output 9596 confidence intervals are given by 


Bo : (3.321 + 2.069 x 1.371) = (0.4844, 6.1576) 


and 


B, : (2.176 + 2.069 x 0.124) = (1.9190,2.433). 


Note that the rather narrow confidence interval for 8; suggests that our point estimate 
for the slope is probably quite reliable, while that for the intercept is much less so. This 
is a numerical verification of the observations made above concerning the sizes of the 
variance multiplication factors бо and бу. 

One can also obtain confidence intervals for c. However, such confidence intervals 
appear to be rarely used in practice. The details are left to the reader. 
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3.6.1 Simultaneous Confidence Intervals 


Occasionally it is helpful to obtain joint confidence intervals (or more generally a joint 
confidence region) for (89, 81), because (3.106) and (3.107) will not hold simultaneously 
with the same level of confidence. 

A number of procedures have been devised to solve this problem and we shall consider 
two: rectangular regions obtained through the use of Bonferroni’s inequality and exact 
confidence ellipses. Other methods may be found in [101, 47]. We begin our discussion 
with a formal definition of a joint confidence region for p parameters. 


Definition 3.2 Let (01,02, ..., 0p) be p unknown parameters of a given distribution. 

A joint confidence region for (01,05, ...,05), with confidence at least (1 — о) x 100% is a 

random region С in R?(the set of all real p-tuples) not depending on (061,05, ...,0,) such 
that 

P((01,05,..,0,)) €C} 21— a, (3.108) 


for all possible values of (01,05,...,0,). If (3.108) holds with equality, then C will be 
called an exact (1 — а) x 100% joint confidence region for (01,65, ..., Ap). 
Note: If p — 1 and C is the region determined by the pair of estimators (ô L, до), 


Ôr, < 6y then the confidence region determined by (9 L, до) is just the usual notion of a 


confidence interval for 0. 
One simple method for generating joint confidence regions is to take rectangular 
regions of the form 


C= П UN (3.109) 
t=] 


where (бр, Ôv, ) is (1 — о/р) x 100% confidence interval for 8;. 


The validity of (3.109) follows from a well known inequality in probability theory, the 
first Bonferroni inequality. We will state and prove this inequality in Theorem 3.4 and 
then use it to establish (3.109). 


Theorem 3.4 (Bonferroni's inequality) Let E1, E»,..., Ep be p events in a probability 
space Q with P{E;} = 1 — a;,1 <i < р, then 


elfet асу. (3.110) 
i=1 i=1 


Proof. From de Morgan’s law 


{йа} 2-00). (3.111) 


PUBL =1-р{Г\в}. (3.112) 


Thus, 
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Now it is well known that 


PUB <Ў`Р(Е) =) [1—Р(Е)}= у о. (8.118) 


=]. 


so that 

р р 
pifia) >1- о, (3.115) 
as required. BM 


We now use (3.110) to establish (3.109). For this let (8... du.) be a (1 — a/p) x 100% 
confidence interval for 0;, 1 < i < p. Then 


P {6r, < bi < би} = 1-о/р. (3.116) 
Let E; = là, < 0; < йо, } so that 
2 cO -P {fat (3.117) 


From (3.116) and (3.117) 


р 
РУГЕ, 21-p. 2 =1-0, (3.118) 
i=1 p 
so that C is a joint confidence region with at least (1 — o) x 10096 confidence since 
p 
( E: = {(81, 02, ---, 0p) € С}. (3.119) 
i=l 
Specializing this argument to the case of the simple linear regression model we know 
that А 
{8; + Фл—2,е/40 (д) } ‚2 = 0,1 (3.120) 
is a 1 — 2/2 confidence interval for 8;, i = 0,1. Thus the rectangle determined by 
{Bo + th_2,0/4E (80) ) X fb, E tn—2,a/40 (,)} (3.121) 


is a joint confidence region for (Bo, 8,) with at least (1 — о) x 100% confidence. 

If one uses more sophisticated arguments, then it is possible to obtain "smaller" 
confidence regions than those determined by (3.121). For (08$,0,) these regions are 
ellipses rather than rectangles. 
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Theorem 3.5 For the simple linear regression model with independent М(О, о?) 
errors, a joint (1 — а) x 100% confidence region for (Bo, B1) is given by the set 


n (By ~ By) -2Y 22 (By - Bo) (b - 8) +22 (By — Bs)” оао (3122) 
i=1 i=1 


where fan—-2 = P{Fon-2 >a} and Fon-2 is an Е random variable with (2,n — 2) 
degrees of freedom. 


Proof. We will also obtain this as a particular case of a more general result for 
multiple regression models in Chapter 5. E 


3.7 Hypothesis Tests for (05, 61) 


So far we have assumed that the model that relates y to x is of the form in (3.1) and all 
subsequent results have been based on this assumption. We now turn our attention to 
assessing the validity of these assumptions. 

Assuming for the time being that the model is linear and that = is the only possible 
explanatory variable we consider the question of determining whether z is helpful in 
explaining the variation in y. That is, we wish to distinguish between the models 


Y; = Bo t €i (3.123) 
and 
Yi = By + бүлң + ei. (3.124) 
In parametric terms we want to test 
Но: 8; =0 against Hı : 8; #0. (3.125) 


In accepting Ho, we can conclude that x is of no use in explaining the variation of у, 
whereas accepting H; suggests that it is. However keep in mind that one can only come 
to these conclusions if one truly knows that the model is linear. In practical situations 
one usually does not know this for certain, and a more cautious interpretation is called 
for. In particular, accepting Но may not mean that 2 is not a useful explanatory variable. 
It may only indicate that the variation of y with x may not contain a linear component. 
On the other hand, accepting Н; suggests that there is a linear trend of y with x but 
other types of variation may be present as well. 

To develop а test of Но against Hj we use the confidence intervals developed for 8, 
in the previous section. These confidence intervals, depending on the confidence level, 
may be viewed as giving a plausible range of values for the true slope £,. If zero is in 
one of these intervals, then at the appropriate level of confidence zero is a plausible value 
of B, and we accept Ho. If zero is not in the interval, then zero is not а plausible value 
B, and we reject Но. Thus we reject Ho if 


By + tn—2,a/2F (&.) « 0 or By = tn—2,a/2T (2,) > 0. (3.126) 
Rearranging we find that the critical region for rejecting Но is given by 
B B 
- Li < —in-2,9/2 OT m > tn-2,0/2 (3.127) 
e (8) ô (01) 
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which is equivalent to 


^ 


8, 
a (8) 
where 6 (1) = s/ /S,,. The quantity 8/5 (4,) is referred to as the observed t value, 


so our test is to reject Ho if |t (observed)| > t (tabulated). From our discussion in Section 
2.13 we see that this test has significance level a. 


> tn—2,a/2) (3.128) 


Example 3.13 For the soft drink delivery data in Example 3.4 can we reject Ho at 
the 5% level of significance? 

From the computer output we find that the observed £ value for 9, is 17.55. Since 
t23,.025 = 2.069 we can reject Но at the 5% level of significance. In fact, the probability 
that |t| > 17.55 if Но is true is less than 0.0001 so that we can accept Hı with at least 
99.99% confidence. 


An intuitive interpretation of the t-test for 8; can be given in the following way. Note 
that ô (4,) estimates the error in estimating £,, so that t = p /8 (4,) | represents the 
reciprocal of a “relative” error in estimating £,. If 8, is well estimated, then we expect 
б By) to be “small” in relation to Bi; hence t is “large”. Thus, the t-test says that we 


should accept 8, # 0 provided that the 5; is well estimated relative to its error. 

Most modern regression programs print out the p-value P (T > |t (observed)|} where 
T is а T random variable with n — 2 degrees of freedom. If this information is not 
available, a quick “eyeball” test can be given. Note from Table А.З that £25,055 = 2.060 
and this value does not change significantly as the degrees of freedom increases. This 
is a consequence of the fact, observed in Chapter 2, that the limit of a t density as the 
number of degrees of freedom increases is a №(0, 1) density with 20.025 = 1.96. 

This leads to a simple rule for rejecting Ho for a sample size n > 20. Reject Ho if 
|t (observed)| > 2. This rule will be extended to multiple regression in Chapter 5. 

In situations where one has a pre-conceived notion as to the true value бу of 8}, the 
t-test for (3.125) can be extended to test 


Ho: 8, = бу against Hı : 8; zb (3.129) 
by forming the £ statistic А 
t (observed) = ВСС аа (3.130) 
6 (Bı — b.) 


and rejecting Но if |t (observed)| > tn-2,a/2 for a test with significance level a. Since 
б (4, — 2 = (д) because, Var (2, — bı) = Var (2), 
By -bı 
e (b) 

As for (3.129), this test can be derived by a confidence interval argument. We leave 
the details to the reader. 


t (observed) = (3.131) 
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As for testing for the value of the slope of y = 8) + ix + = one commonly tests 
for the significance of the intercept. Here, the situation of most interest is determining 
where the true line passes through the origin (0,0). This leads to the test 


Но: 85 = 0 against Hj: Æ 0. (3.132) 


Accepting Ho suggests that the simpler model у = 8,2 + = explains the data better than 
y = bo + 8,2 +=. If Ho is true, the data may be refitted by this simpler model. We take 
this matter up in greater detail in Section 3.8. 

As for fy, we can test (3.132) by using the t statistic 


^ 


Bo 
ô (Ba) 
and reject Ho at level о if |t (observed)| > 1,2 6/2. As for 8, an “eyeball” test is to 
reject Ho provided that 


t (observed) = (3.133) 


|t (observed)! > 2. (3.134) 


Again, this test can be derived by using a confidence interval argument. The details 
are left to the reader. 

One sided-tests for the regression coefficients can be obtained by using one-sided 
confidence intervals as indicated in Section 2.13. 


3.8 The ANOVA Approach to Testing 


In the previous section we have seen how a t-test could be used to test for the significance 
of the regression. Here we will develop an equivalent F-test, which, as we shall see, can be 
generalized to test for the overall significance of the regression in the multiple regression 
case as well. This will be useful since the corresponding tests for the significance of the 
individual coefficients may be misleading. 

The idea behind this method, called the analysis of variance (ANOVA) approach to 
testing, is to determine how much of the variability in the observations (y1, yo, ..., yn) is 
“explained” by the regression line. For this purpose a natural measure of variability in 
the data is the quantity, the total sum of squares; 


SST = У (у; — 9», (3.135) 


ї=1 


which is n — 1 times the sample variance of (y1, y2,..-,; Yn). If the regression line у = 
Bg + B4z fits the data well, then 0; = yi so that 


т . = 2 т Е 
У (й—-0) = У (и - 7). (3.136) 
i=1 
Since 7 = y if Bo Æ 0 (this will be shown in Theorem 3.8), then 


P3 (Gi — 8) = Уу, (à; - 9) (3.137) 
ї=1 1=1 
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so that the ratio - ; 
р? _ SSR _ $ a0: 79) 

SST 5 EE 
Di. (yi y) 


measures of the fraction of the variability in (y1, Y2, ..., уһ) that can be accounted for by 
the regression line. For a good fit we would like this quantity, usually called the coefficient 
of determination, to be close to one. Since R? can also be shown to be the square of the 
sample correlation coefficient between (y1, ys, ..., Yn) and (1, o; ..., Jn), this leads further 
credibility to R? as a measure of *goodness of fit." 

Because R° is a function of the sample values (31, y», ..., Yn), it is a random variable, 
so if we know its distribution it would seem plausible to test for the significance of the 
regression by rejecting Ho : 8, = 0, if А? is sufficiently close to one. Since any monotone 
function of R? gives an equivalent test, it is convenient to consider 


(3.138) 


(п – 2) R? 


Е = 
1- R? 


(3.139) 


The reason for this rather arbitrary choice will become apparent after we develop 
some of the basic properties of R? and F. These are given in Theorem 3.6 


Theorem 3.6 Assume that SST #0. Then 
(i) 0< R? <1; 
(ii) R? = 1 — SSE/SST; 


(iii) R? = 1 if and only if ĝi = yi, 1 <i < n (ie., the data points all lie on the 
regression line). 


(iv) If x= (1,95; зый) апа y= (Y1, Y2, Un) Д then R? = p? (x, y) ’ 
where 


р? (х,у) = (3.140) 


is the square of the sample correlation coefficient between (x, y). 
(v) F = 55/5? = T? 


(vi) If the errors є;, 1 € i < n, are independent N (0,0?) random variables and B4 = 0 
n (3.1), then F has an F-distribution with (1,n — 2) degrees of freedom. 


Note that it is the relation in (v) that reveals our interest in R? and F. From (v) 
and (vi) we see that using any of T, R? or F provides us with equivalent statistics for 
testing the significance of the regression 

We also point out that the size of R? is almost always examined when doing a least 
squares fit. It provides a rough “eyeball” goodness of fit test; ie., the closer R? is to 
one, the “better” the regression line fits the data. (However, some caution is needed in 
this respect and we will return to this point later.) The F ratio may then be viewed as 
a statistic for testing the significance of large values of R?. 
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Before proving Theorem 3.6 we point our that many of the details depend on estab- 
lishing the following decomposition of SST: 


n 


у, (жм = 7)? = 2. (9; = 7) + ` (yi — $i)? (3.141) 


i=1 i=1 i=1 
Equation (3.141) is usually written in the form 


SST = SSR 4- SSE (3.142) 


where SST is called the total sum of squares (also the total adjusted-for the mean sum 
of square), SSR is the regression sum of squares and of course SSE is the residual sum 
of squares. 

To establish (3.141) we need a lemma concerning the residuals from the least squares 
fit. 


Lemma 3.1 Let ё; = y; — yi, 1 < i < n, denote the residuals from the least squares 
fit Yi = Bo + 824, 1<1< n, to (01,72, NUR Then, 


(0 Dini ê: = 0; 
(й) ў =7; 
(iit) DP 0 
Proof. (i) From the first normal equation 05/08) = 0, where S is given by (3.9), 
we get ы J Е 
У; (v. — By- Byes) = У (и-#) = > 58% =0. (3.143) 
i=1 i=1 i=1 


ii) From (i) У yi = У, fi, so dividing both of these sums by n gives (ii). 
i=l ї=1 
(iii) From the second normal equation 85/08; = 0 we get 


Y (v. ze д.а) т; = e = 0. (3.144) 
i-l 


P 
Thus, 
ya = ya (ĉo + ĉizi) = Уай, + Усаа, 
= $535 +В, 3 = 0. (3.145) 
= c 
Е 


Note that the fact that Foai ё; = 0 provides a convenient check on the accuracy 
of least squares calculations. A rough “rule of thumb” is that one can expect that the 
number of valid significant figures to the right of the of the decimal point in 59, 8}, etc. 
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is about the same as the number of zeros to the right of the decimal point in the sum of 
the residuals. 
We now use Lemma 3.1 to prove Equation (3.142). For this we write 


Be (yi — 9) 


il 


У (vi = 9) + (0 – DP 

i—1 

= Y m- 2» (5 -9)(9:;- 9) - У G-D’ 
i=l i=1 f 


= 55Е + 558+2У (yi — ĝi) (0 - 7). (3.146) 
i—1 
Now from Lemma 3.1 we get 


n n 


N (m-i) (0-7) = 3 d£ -gY 2: =0+0= 0. (3.147) 


i=1 i=1 i=1 


Thus, 
SST = SSR + SSE. 


Proof of Theorem 3.6. (i) From (3.142) 


SSR . SST 
< Oe < —— = : 
OSR = cor 5 сет Hus) 
(ii) From (3.142) SSR = SST — SSE so that 
а SST-SSE _,_88Е iem 


SST | SST’ 


(iii) From (ii) R? = 1 if and only if SSE = 0. Since SSE = У (yi — ài). = 0, 
then y; = ĝi, i 1,2, sm. 
(iv) From the definition of ў; we get 


ĝi = Bo + Bye; =ў— BE + үт = 7 — Ê, (zi — F), (3.150) 
so that А ч 
с „2 2 
SSR-Y (4; -) =. у (ai - T) = 815... (3.151) 
i=1 i=1 
Using the definition of the sample correlation coefficient 
n = _—\12 
і=1 Ti — 2) Yi — V 
Sra Syy 
and the formula n Е Е 
ИЗНЕ ea TN) (3.153) 
Bus 
б Э 
д2 (x,y) = Pee . SOR _ ра (3.154) 
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since Soy = y 4 (ye — y) = SST. 
(v) From the definition of F, 
n—2)R^ (n—2)(SSR/SST) SSR SSR 


NE = = 
"=e 089/887) — SBR/(m— 97 (8199 


Also from (3.128) 
„2 
2 By 815.2 _ SSR _ 
5° DE s? s? 
(vi) If {є;},_{ are independent N (0, с?), then as was stated in Section 3.2 T has a 


t-distribution with n — 2 degrees of freedom. Since F = Т?, it follows that F has an 
F-distribution with (1, n — 2) degrees of freedom. M 


F. (3.156) 


It is common practice to summarize many of the results of Theorem 3.6 in a tabular 
format like that in Table 3.8. This table is usually referred to as the ANOVA (short for 
analysis of variance) table for the regression analysis and is standard output from all of 
the commonly available regression packages. 


Table 3.8 Analysis of Variance (ANOVA) Table 


Source df SS MS F 
Regression 1 SSR MSR= - » MSE 
Residual ъ—2 SSE MSE = er Б s? 

Total n— 1 SST 
R? = —— 


The meaning of the ANOVA Table headings is as follows: 
Source: source of the sum of squares. 
df: the number of degrees of freedom associated with the corresponding sum of squares. 
SS: Abbreviation for “sum of squares.” 
MS: Abbreviation for “mean squares,” defined by SS/df. 
F: The F-ratio, defined by MSR/MSE = MSR/s?. 


R?: SSR/SST. (Note that some authors express R? as a percentage - the percentage 
of variation in the observations explained by the regression line.) 


Classically, the df associated with each sum of squares (SS) derives from the fact that 
under the standard normality assumptions about the errors, SSR/o? is x? (1), SSE/o? 
is X? (n — 2) and SST/o? is x? (n — 1) when the null hypothesis 2, = 0 is true. The 
degrees of freedom are then the degrees of freedom associated with the corresponding 
x? random variables. However, even if the errors are not normal, it is still customary 
to associate the x? degrees of freedom in the normal case with the corresponding SS in 
the non-normal case as well. For example, one can think that in SSE the n residuals 
are free to very subject only to the conditions imposed by the least squares equations: 
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У) ё; = 0 and У), wié; = 0. Then there are n — 2 degrees of freedom associated with 
SSE. Similarly, у; —3 are constrained by the fact that $77 , (yi — y) = 0 so SST has n—1 
d.f. Since SSR = SST — SSE, the degrees of freedom of SSR is (n — 2) — (n — 1) = 1. 
Further extensions of this table for simple linear regression will be given in Section 
3.9 and for multiple regression in Chapter 5. 
Before giving several numerical examples of ANOVA tables we prove the distribution 
properties of SSR/a?, SSE/a? and SST /a* mentioned above. 


Theorem 3.7 If the errors £&j, 1 < i € n, in (3.1) are independent N (0,67), then if 
Ho : B, = 0 is true, SSR/o? is x? (1), SSE/a? is x? (n — 2) and SST/o? is x? (n — 1). 


„2 
Proof. As shown in the previous theorem, SSR = 0,5,, so that 


SSR D. _ 291 


о? g? 


s (3.157) 


From Theorem 3.1 8; is № (81,02/8.,) so that B, Sus /o is N (81 / Ss /o, 1). 

Now if 6, = 0, then B, /S,,/c is N (0,1) and SSR/o? is the square of a N(0, 1) 
random variable and so is x? (1). As we pointed out in Section 3.3 SSE/o? is x? (n — 2), 
regardless of whether 3, = 0. (A proof will be given in Chapter 5.) Also, SSE/c? is 


independent of 8, and so of A (See Section 3.6.) Thus, SSR/o? and SSE/o? are 
independent random variables. Since 


SST  SSR SSE 


c? c? о? ’ 


(3.158) 


SST/o? is the sum of two independent x? random variables and so is x? with (n—2)--1 = 
n — 1 degrees of freedom. Bl 


Example 3.14 Calculate the ANOVA table for the data in Example 3.2 and use it 
to determine the percentage of variability in the data explained by the regression line. 


Also test if 9, = 0 under the assumption that the errors are independent N(0, 07). 
Using the calculations made in Examples 3.2 and 3.6, we find that 


2-3) =6 


SSR = iS, = (0.7)? x 10 = 4.9 


SST 


and 
SSE = SST — SSR = 6 — 4.9 = 1.1. 
Thus, Sk 
R 49 
dol E э. же лы Dp e. 
R SST б 0.8167 
ane SSR 49 
= - = 19:91: 


s? 0.3667 
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Arranging these data in the ANOVA table gives: 


Table 3.9 ANOVA Table for Drug Data 


Source df Sum of Squares Mean Squares F 
Regression 1 4.9 49 13.37 
Residual 3 1.1 0.3667 
Total 4 6.0 


R? — 0.8167 


From the ANOVA table we see that since R? — 0.8167, the regression line explains 
81.67% of the variability in the data. 

To test for the significance of the regression we use a = 0.05. If the computed F value 
13.37 exceeds the tabulated fo.05,1,3 value then we can reject Ho at this level, otherwise 
Но is accepted. From Table A.4 fo.o5,1,3 = 10.13, so that Ho is rejected and the regression 
appears to be significant at this level. 

Note that as the general theory shows F = T?, and so using F or Т, gives equivalent 
tests in this instance. 


Example 3.15 Compute the ANOVA table for the drink delivery data in Example 
3.4 and use it do draw conclusions about the validity of this model. 

Again, due to the tediousness of the calculations, we resort to computer output for 
the necessary calculations. This yields Table 3.10. 


Table 3.10 ANOVA Table for Delivery Data 


Source df Sum ofSquares Mean Squares F 
Regression 1 5, 382, 409 5,982,409 307.8 
Residual 23 402, 134 17,484 
Total 24 5, 784, 543 


R? = 0.9305 


We note first that the R? value shows that the model explains over 93% of the 
variability in the data. This indicates that the fit is quite good and the large observed 
F value 307.8 is significant at the 0.0001 level. This, coupled with the small value of 
s — 4.18, compared to the mean value of y — 22.38 suggests that this model appears to 
explain virtually all of the variability in the data and could in its present form probably 
make a useful predictor. We will return to this point later. 


Example 3.16 Construct the ANOVA table for the tractor maintenance data in 
Example 3.5. 

Here, because hand calculations are inconvenient, we use the data in the computer 
output given for that example. 


Table 3.11 ANOVA Table for Tractor Data 


Source df Sum of Squares Mean Squares F 
Regression 1 1,099, 635 1,099,635 13.68 
Residual 15 1, 205, 407 80, 360 
Total 16 2, 305, 042 


R? = 0.477 
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Note that even though А? is relatively small, that F = 13.68 > fo.o5,1,15 = 4.54 so 
that the regression is significant at the 5% level. As is indicated in Exercise 3.8 when 
there are repeat values of the observations at a given x value, R? will generally always 
be smaller than for a similar data set with no repeat values. Again this suggests that 
the analyst should be cautious in using only R? as a measure of fit. Low R? values do 
not necessarily mean that the regression is not significant. There may be just a large 
amount of unexplainable random variation in the data. 


Example 3.17 Even though large values of R? indicate an approximate linear rela- 
tion between т and y one must be cautions in interpreting а high degree of correlation as 
implying causation. An interesting example concerns data collected by Kendall and Yule 
[70]. From 1924 to 1937 the number of certified mental defectives/100,000 population in 
England was recorded along with the number of radio licenses. The data are shown in 
Table 3.12. 


Table 3.12 Radio Licenses and Mental Defects Data 
Year Radios in millions (x) Mental defectives (y) 


1924 1.350 8 

1925 1.960 8 

1926 2.270 9 

1927 2.483 10 
1928 2.730 11 
1929 3.091 11 
1930 3.647 12 
1981 4.620 16 
1932 5.497 18 
1988 6.260 19 
1934 7.012 20 
1935 7.618 21 
1936 8.131 22 
1937 8.593 23 


A scatter plot shown in Figure 3.10 shows a clear linear trend in the number of mental 
defectives as the number of radios increases. 
Based on this, the data were fit by least squares with the following results; 


By = 4.5822, to = 10.82, p < 0.000, 
B, = 2.20417, t, = 27.31, p < 0.000 


and the ANOVA table is given in Table 3.13. 
Table 3.13 ANOVA Table for Radio Licenses Data 


Source df Sum of Squares Mean Squares F 
Regression 1 393.39 393.39 745.96 
Residual 12 6.33 0.53 
Total 13 399.72 


R? = 0.984 


These results shows a highly significant linear relation between x and у with the least 
squares line explaining 98.496 of the variation in the data. On the basis of these figures 
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Figure 3.10: Scatter plot for radio and mental defectives data 


one might conclude that radios were making people crazy. Although that is certainly 
possible, à more benign explanation can be given. That is, if both т and y were each 
increasing linearly with time, then clearly y would increase linearly with =. This latter 
explanation seems more plausible, since radios became more available with time and 
better diagnostic procedures enabled one to do a better job of identifying people with 
mental problems. 


Since the value of R? is so frequently used as a measure of goodness fit, it is important 
to caution the reader that sometimes large values of А? may occur merely because the 
range of the data is large, producing large values of S,,. To see this, we note that a 
result of Hahn [48] shows that 


2 
E (R?) с — 215s (3.159) 
в 19x + о? 

and this is easily shown to be an increasing function of S,,. Thus, a large spread in the 
independent variable (x1, £2, ...,2n) may produce a large value of R? having little do with 
the quality of fit. Also note that E (R?) is an increasing function of А2. Thus, models 
with “large” slopes will generally produce larger values of R* than those with smaller 
ones. Again, we observe that a large value of R? may occur which is not indicative of 

the quality of the fit. 
In general, we need to use multiple criteria in assessing the quality of the fit. Among 


these criteria are: 
(i) ‘large’ R?, 
(ii) ‘large’ F or |t| values, 
(iii) ‘small’ values of s? relative to J, the mean of the observations. 


As we shall see, other criteria need to be examined as well, and these will be discussed 
as we proceed. 
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3.8.1 Regression Through the Origin 


So far we have considered the simple linear regressions model in the form of (3.1) where 
initially it is assumed that 8, = 0. However, there are some circumstances where the 
appropriate model requires that 8, = 0. In this case we have a situation which differs 
slightly from then when 2, # 0 and we will discuss the similarities and differences in 
estimation and inference in this section. As noted previously, the regression problem 
with бо = 0 is usually called regression through the origin. 

First, where would the model 


Y; = bizi +E 1<1< п, (3.160) 
be more appropriate than (3.1). There are two such circumstances. 


(1) If it is known a priori from physical considerations that E (Yo) = 8, = 0, then there 
is no point in using a degree of freedom to estimate { 9, since this will generally 
decrease the accuracy in estimating o? and so the accuracy in estimating 64, will 
generally be decreased as well. 


(This is, as we shall see in Chapter 8, а general property of including extraneous 
variables in the regression model.) 


As a hypothetical example, suppose we wished to make a model of the weight y 
of a given population as a function of weight т. If we knew that the model was 
linear at x = 0, then certainly a person of zero height would have zero weight and 
choosing у = 0 is appropriate. 


(ii) If we believe initially that 6) # 0 and a t-test suggests on the basis of the observed 
data that 8, = 0, then fy might be eliminated from the model. Since in many 
practical cases one cannot be sure that the model is valid near the origin, some 
statisticians insist that the intercept always be kept in even if it appears statistically 
insignificant from zero. 


One should be cautioned, particularly in using (3.1) that choosing the intercept 6, = 0 
initially, may not be correct even if physically E (Yo) = 0. Unless we know for certain 
that the model is linear near = = 0, setting fy = 0 may lead to badly biased estimates 
of 8; if the independent variable is measured far away from = = 0. A possible example 
of what could happen is shown in Figure 3.11. 


3.8.2 Estimation and Testing for Regression through the Origin 


When 8, = 0, the least-squares estimate, By of B, is given by minimizing 


S= È (yi — b12)? (3.161) 


i=l 


and is easily found to be 


(3.162) 
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Figure 3.11: True relationship between weight and gestation period in Ex. 3.6 


If the errors e;, 1 < i < n, are independent N (0, с?) random variables, then B 1 is unbiased 


and 


о? 


jl 
In this case ду ~ N (8\,о?/ X; z7) and 
"(yi — 9)? 
аа 9. SEES (3.164) 


n— 1 n—1 


is an unbiased estimate of c?. | 
Additionally, SSE/o? is x? (n — 1) and independent of 8,. The significance of the 
regression can be tested using the T' statistic 


T= DEM ET L- = 2122. 

uL. (A) 

which has а ¢t-distribution with n — 1 degrees of freedom. A (1 — а) х 100% confidence 
interval for 8; is given by 


(3.165) 


G +ta-1,a/20 (8.)) (3.166) 


From this discussion one can see that the basic theory for the case Jy = 0 is quite 
similar to that for Gy # 0. However, the one place where there is some difference is in 
the ANOVA table and in the development of a goodness of fit measure. The problem 
here is that the basic decomposition of the sum of squares SST = SSR+ SSE fails to 
hold in this case. The reason for this is that the sum of the residuals 377 (yi — $i) is 
not always zero in this case so that 7 Æ y which is required to prove (3.143). Thus, an 
R? value cannot be defined via (3.138) so we need to modify the ANOVA table and the 
ANOVA approach to testing. 


3.8. THE ANOVA APPROACH TO TESTING 


93 


We begin with a new decomposition of a sum of squares, which is valid both in the 


case By = О and д, % 0. We give a proof for 8, = 0. 


Theorem 3.8 In the linear regression model (3.160) with By = 0 


Уи = Уй i+w wi)” 


Proof. Writing у; = (yi — $i) + ĝi and squaring gives 


Ус) +2) да, 


Now from the least square equation 25/98; = 0, we get 


= (v. - Йу) gi = 0, 


i=l 


and multiplying (3.169) by 2, gives c = f 4) 
n п, 
У (и -#) =) у йё& = 0. 
i=1 i=1 


Thus the last term in (3.168) is zero and (3.167) holds. M 


(3.167) 


(3.168) 


(3.169) 


(3.170) 


If we now use $77 , y? as a measure of the variability of the data, then a reasonable 


analogue of R? in this case seems to be 


and 


(3.171) 


(3.172) 


(3.173) 


(3.174) 
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so that with this definition the relation between R?, F and Т? is the same as when 
By #0. 

However, this definition of R? is generally not used in practice, since it does not allow 
for a direct comparison of the goodness of fit of models with and without an intercept. 
To see this suppose we wish to compare the fit of a model with 8, # 0 with one where 
В, = 0. It appears reasonable to choose as the “better” model the one with the larger 
Е?. 

However, when By = 0 


SSE 
x y? 


and since one generally has У? , (yi — y < Mua Ve the А? value in the zero intercept 
model can be considerably larger than the R? value when 8, Z 0 even if the SS Es are 
comparable in both cases. 

Because of this problem, the definition of an appropriate R? value when y = 0 has 
generated some controversy and a number of alternative definitions have been given in 
[47]. Perhaps a satisfactory choice is to use R? as the square of the correlation coefficient 
between у = (yi.ys,..., 9a) and ¥ = (1, H2,.--; $n) because this property holds when 
Bo 5 0. We will use this definition in all further discussions of the zero intercept model. 

Another way of comparing the fit of the zero intercept model with that when 8, 4 0 
is to use s?: The model with the smaller value being preferred as the one with the better 
fit [47]. 

Summarizing, the ANOVA table for the zero intercept model is given in the format 
shown below. 


Б? = 1— (3.175) 


Table 3.14 ANOVA Table for the Model with 8, = 0 


Source df Sum of Squares Mean Squares F 
Regression 1 SSR= ae 92 MSR = 858/1 E 
n E 
Residual п-1 SSE-»,. : —§?) MSE- 2E = s? 
Total n SST = PDA 
irt. ӯ) 


Example 3.18 In Example 3.6 (Birth weight data) we considered the relation be- 
tween birth weight and gestation period for newborn babies. The fitted model was 


ў = —1393.0 + 113.202. 
Further analysis gives the ANOVA table: 


Table 3.15 ANOVA Table for Birth weight Data 


Source df Sum of Squares Mean Squares F 
Regression 1 973, 295 973,295 25.05 
Residual 22 854, 707 38, 850 
Total 23 1, 828, 003 


R? = 0.532 


3.8. THE ANOVA APPROACH TO TESTING 95 


From the F' value we see that the regression is significant at level < 0.01%. However, 
the t value for fy is —1.60 which is not significant at the 5% level (p = 0.125). 

In addition, fs has а large negative value which gives an unphysical value if we 
extrapolate to т = 0. Although т = 0 is far out of the range of the observed data, it is 
reasonable to consider refitting the data with a line through the origin. Doing this gives 
B, = 77.130 and t, = 71.59 which is significant at less than the 0.01% level. Moreover, 


ó (1) = 1.077 for this model compared to 22.62 for the intercept model. Hence, the 


slope is estimated much more precisely than when the intercept is included. 

For the intercept model R? = 0.532 and s = 197.1 while for the zero-intercept model 
s = 203.6 and p”(y,¥) = 0.533. Thus both models appear to fit the data equally well, 
but the greater precision of the slope for the zero-intercept model suggests it might be a 
better choice for prediction. 


3.8.3 Prediction 


Once one has fitted the regression model to the data it is often used to estimate values of 
y at values of z which were not in the original data set. In fact, making such predictions 
is probably the most likely reason one attempts to fit a model in the first place. 

There are two types of prediction we consider. First, we can predict a value of the 
mean response и. = E (Ү.,) at a point то. Second, we can predict the value of a new 
observation Yz, = Yo taken at zog. For both of these we use the point estimate 


jo = By + Bzo. (3.176) 


However, interval estimates are different in each case. Since и. = Êo + iTo is a 
parameter, we can obtain confidence intervals for it, at least under our standard normality 


assumptions. To do this we begin by obtaining a formula for Var (fo). 


Using the formulas for (Bo. B,) , Yo = 9+8, (zo — T) and the fact that Cov id B 1) = 
0, as shown in Theorem 3.1 gives 


Var (fo) 


Var (Y) 4 (zo — т)? Var (ё) 


т Sra 
22 
о? - n ud | (3.177) 


To estimate Var (fo) we replace c? in (3.177) by s? giving 


n Os 


e (fo) = 52 E + d | (3.178) 


The square root of 8? (fo), б | 
T = То. 


o), is usually called the standard error of prediction at 
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Now if the errors are independent N(0,07), then, 


o 2|l | (zo- z) 
f~ N c + б), (3.179) 
so that У 
Yo — Hes. N (0,1). (3.180) 


Ç) 
Since Yo and s are independent random variables in this case, the random variable 


Yu 
тт Ра (3.181) 


°(%) 
has a t distribution with n — 2 degrees of freedom. Thus, using standard manipulations, 
a (1 — a) x 100% confidence interval for ш, is given by 


Yo E ta 2/26 (fo) | (3.182) 


One can see from (3.179) that we get the shortest confidence intervals for rg = т 
and as |zo — T| increases, these confidence intervals increase in width. In particular, the 
farther away we are from the region where the original data are taken, the less reliable 
are our predictions. Since the true model may not even be valid outside the region where 
data have been taken, one must be quite cautious in using the model to predict Y outside 
of the interval [min 2;, max z;]. 

Interval estimates for Yp, are not confidence intervals since Y;, is not a parameter. 
These estimates are referred to as prediction intervals. 'To obtain these, we first consider 
the variance of 


ре, 7 
If the new observation Y,, is independent of Y;, 1 < < n, then 
Var (s — 25] = Var(Y4)- Var (Fa) 
1 m 
= —ÓÀ l , (to ч 
бет 
1 (a9 —%)° 
о? |1+—-+ © (3.183) 


To estimate Var Ds — Yoo) we replace с? with s? and hence с (Yes - Ys) is estimated 
by 


saaie асои (3.184) 
di n Su | 
Thus, under our standard normality assumptions 
Yoo — Yu 
T= o o (3.185) 


1 Nr 
| n See 
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will have a t distribution with n — 2 degrees of freedom. Hence 
Р {—tn_2,0/2 ST <, аә} =1l-a (3.186) 


and from this we find that 


^ 


Ү2, — in 2.9 /28p < Ys, < У + tn—2,a/2Sp (3.187) 
with probability 1 — a. The interval 
со vds (3.188) 


is called a (1 — œ) x 100% prediction interval for Y,,. The same comments regarding the 
reliability of prediction of a new observation hold as before. 

One can see from (3.183) and (3.185) that the reliability of prediction increases as 
the sample size п increases and as the range of the z's, measured by 5,, increases, while 
the reliability decreases as |20 — Z| increases. If one can choose the z;'s a priori, one can 
increase the reliability of prediction by choosing the independent variable well spread 
out. On the other hand, as we noted previously, this tends to inflate R?, leading perhaps 
to a poorer fit. This is а fundamental contradiction in regression analysis. A good fit 
may not result in good prediction, while good predictions can be made from less reliable 
fits. 

Finally, we note that these results hold strictly only if €;,1 < i € n, are independent 


N(0,0?). However, because under very general conditions the errors, (Bo, B,) are as- 
ymptotically normal, the confidence intervals for E (Y,,) will be valid for large n. But 
the prediction intervals depend on the normality of the errors even for large values of n, 
so may not be valid when normality is violated. 


Example 3.19 In the drug response example, Example 3.2, find a 9596 confidence 
interval for the mean reaction time if the percentage of drug in the blood stream is 6%. 
Here xp = 6, so that the estimated prediction variance is 


1 (6-3?| 11[1 9 
= = == д шүл 
5* 10 | 3 Hz OS 


ô? (%) = s? 


and the standard error of prediction at 2 = 6 is 
ô (Ys) = 704083 = 0.6351. 
Hence, the estimated mean response is 
Ye = —0.1 + 0.7 x 6 = 4.1 
so that а 95% confidence interval for Ў; is given by 
4.1 + £3,0.0256 (Ye) = 4.1 + 3.182 x 0.6361 = (2.079, 6.121). 


А 95% prediction interval at т = 6 is given by 


4.1 + 3.1824/6? (Ye) + 1 = 4.1 + 3.182 x 0.8774 = (1.308, 6.896) . 
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Example 3.20 (Clark County population data) Many economic and social activities 
require the accurate prediction of population sizes. For example, businesses need to 
be able to estimate the size of a market for a product and governments need good 
population estimates in order to plan for schools, roads and the allocation of funds for 
social programs. In the United States the importance of this was recognized by the 
founding fathers and the constitution requires that a census be conducted every ten 
years. 

These problems are particularly acute in the authors’ home town of Las Vegas, 
Nevada. Las Vegas is located in Clark County, Nevada which for many years has been 
the fastest growing county in the United States. (As a point of interest, the famous 
“Las Vegas Strip” does not belong to the City of Las Vegas, it belongs to Clark County.) 
To plan for the future it is important to be able to make accurate predictions of the 
population. 

The simplest approach seems to be to use past census data to model the growth and 
use this model to predict into the future. In Table 3.16 we show the census data for 
the years 1920-1980 and a scatter plot is shown in Figure 3.12. That plot shows steady 
nonlinear growth over that period. 


Table 3.16 Clark County Population 
Obs. No. (i) Year (xi) Population (yi) 


1 1920 4, 859 
2 1930 8, 539 
3 1940 16,414 
4 1950 48, 589 
5 1960 127,016 
6 1970 273, 288 
7 1980 463, 087 
500000 
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Figure 3.12: Scatter plot for Clark County population data 
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Although the trend is clearly nonlinear we begin our analysis by assuming a linear 
model 


where Y is the population and т is time coded from 0-6. 
This line was fit by least squares and the results given below. 


By = —81,331, = —140, p=0.22, 
В = 71,957, tı = 4.47, p= 0.007, 
R? = 0.80, and F = 19.96. 


From these results we see that the slope £, is significantly different from zero and the 
model explains 80% of the variability in the data. However, there are some obvious 
problems with the estimated populations, which are shown in Table 3.17. 


Table 3.17 Data, Fitted Values for Example 3.20 


Year Population Fitted value Residual 
(2) (yi) (ĝi) (ĉi = yi — $i) 
1920 4,859 —81, 331 86, 190 
1930 8,539 —9, 397 17,936 
1940 16, 414 62, 584 —46, 170 
1950 48, 589 134, 541 —85, 952 
1960 127,016 206, 498 —79, 482 
1970 273, 288 278, 455 —5, 167 
1980 463, 087 350, 412 112,675 


Notice that the estimated populations in 1920 and 1930 are negative - obviously a non 
physical result - but the model appears to give more reasonable results at later times. 

As a consequence, we consider using the model to predict the population in 1990 
(x = 7). Using (3.176) we get 


9(1990) = —81,331 + 71957 (7) = 422, 368 
and 95% confidence and prediction intervals are given by 


(a) 95% confidence interval: (237,184, 607,554), 
(b) 95% prediction interval: (135,482, 709,256). 


Even though the fit is significant, it does not predict very well, since the 1990 census 
gave the population of Clark County as 768,203 which lies outside the 9596 prediction 
interval. 

To improve the predictive ability of the model we consider finding one that fits the 
population data better. From the scatter plot it appears that the population grows 
exponentially, so we consider a model of the form 


y = Yo exp (£z) (3.190) 


where 3, is the growth rate. Since (3.190) is not linear in (Yo, 8,), the theory developed 
so far is not immediately applicable. However, taking logarithms in (3.190) gives 


log y = log yg + 8, 1052 (3.191) 
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If we let y’ = logy, Bo = log Yo and z' = log =, then 
y = Bo + Bx. (3.192) 


So (3.192) represents a linear relation between (2', y"). The scatter plot of (z', y’) is 
shown in Figure 3.13 and the points fall almost on a straight line. To test this, we fitted 
a least squares line to (z', y') and the results are given below. 


Во = 8.3379, to =67.07, p< 10-3, 
Ву = 0.80896, tı = 23.46, p< 10-3. 


Table 3.18 ANOVA Table for Clark County Population Data 


Source df Sum of Squares Mean Squares F 
Regression 1 18.324 18.324 550.46 
Residual 5 0.166 0.033 
Total 6 18.490 


R? = 0.991 


From this we see that the linearized model gives an almost perfect fit to the data 
and is significant at « 0.0196 level. One would expect that such а good fitting model 
would be a useful predictor. To check this we again use the model to predict the 1990 
population. Taking z’ = log 7 in (3.192) the estimated values of y’ and confidence and 
prediction intervals are: 


Gt 1990) = 14.004, 6 (ооо) — 0.1545 


(а) 95% confidence interval: (13.6041, 14.3971), 
(b) 95% prediction interval: (13.3863, 14.6148). 


To compare these predictions with the linear model we exponentiate ў’ and the end 
points of the intervals to get. 
$(1990) = 1, 209, 842 


(a) 95% confidence interval: (809, 360, 1, 788,880), 
(b) 95% prediction interval: (651,023, 2,223,960). 


Now note that compared to the true population 768,203 the predicted population is 
in greater error than given by the worse fitting linear model, however it does lie in a 9596 
prediction interval. It appears then, that the better fitting model is not necessarily the 
better predictor. Because we are predicting ten years beyond the range of the data we 
have little reason to believe that the model (3.189) and (3.190) is valid there. 

As a last comment, these data illustrate that one should take with a “grain of salt” 
those predictions one is always seeing in the popular media. Prediction errors can be 
very large, but they are rarely published, e.g., the recent budget surplus predictions. 
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Figure 3.13: Scatter plot for Clark County population, log Y 


3.9 Assessing Model Validity 


So far we have developed the theory of the simple linear regression model under the 
assumption that we know that the true model for E (Y;) is linear. Although this knowl- 
edge may be available a priori, perhaps from physical and/or theoretical considerations, 
in most practical situations we will not know whether the simple linear regression model 
is true, and an important part of the analysis is to see whether such a model is tenable. 
Since the violation of linearity will generally (but not always) invalidate our previous 
analysis concerning (80, 8,) it is desirable to have tests that would check for linear- 
ity before any further analysis is done. For the most part we will only have the data 
{(xi, yi)};_, to work with, so they will have to form the basis for any analysis. 
There are à number of procedures that can be used in this regard: 


(i) Examine the scatter plot of {(z;, y;)}. Pronounced curvature in the plot such as that 
shown in Figure 3.12 suggests that a quadratic model 


Y, = Bo + Bizi + Bo” + €i (3.193) 


might be а more appropriate choice than a linear model. However, the scatter plot 
may be confusing, or even misleading (see Example 5.13) if the departures from 
the simple linear model results from missing variables, rather than curvature in 
z. For example, Figure 3.14 suggests that there is a slight upward trend as shown 
by the least squares line superimposed on the scatter plot. But it is not clear if 
the large fluctuations are just random error, or due to some other factors which 
have not been accounted for. In this case it can be shown (see Example 3.21) that 
the fluctuations can be much better explained in terms of seasonal factors, rather 
than random errors. Of course, once one is aware of the source of the data, such а 
hypothesis is reasonable. 


(ii) Examination of the test statistics associated with a preliminary least squares fit to 
the data. For example, а small value of R? along with an apparently significant t 
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value for By generally suggests that the true model contains variables other than 
z. On the other hand, a large value of R? and significant t values does not of itself 
imply that the model is linear. 


(iii) Residual plots from the least squares fit are another effective diagnostic tool for 
examining the validity of the linearity assumption. Since the residuals estimate 
what variability remains in the data after the linear part in z has been removed, 
it is reasonable to expect that their values would be useful in detecting departures 
from linearity. There are a variety of plots that are commonly recommended for 
doing this, and these will be explored later in this chapter and in more detail in 
Chapter 6. 


Example 3.21 (Jewelry sales data [123]) Suppose we wish to examine how depart- 
ment store sales of jewelry increase over time. Table 3.19 shows quarterly sales from 1957 
to 1960. When we plot quarterly sales against time as in Figure 3.14, we note that sales 
shoot up every fourth quarter because of the Christmas and holiday seasons. 


Table 3.19 Jewelry Sales Data 
Year Quarter Sales (in $100,000) 


1 

2 

3 

4 

1 

2 

3 

4 

1 

2 

3 

4 

1 

2 

3 

4 

After we fit the model Y; = G)9+,t+e, (t = time) we obtained the results as follows: 

Bo = 45.68, to = 2.83, p=0.013, 
B, = 1.921, 4,—1.15, p= 0.269, 


R? = 0.087, and F = 1.33. 


From these results we see that the hypothesis Не: 8; = 0 is not rejected (p > 0.05) and 
the fitted line only explains 8.7% of the variability of sales in the data. Also since the F 
value is 1.33 (< F1,14,9.05 = 4.60) the estimated regression model, which is superimposed 
in Figure 3.14, seems to be poor. Nevertheless, it needs to be noted that the estimated 
slope, B, is biased due to seasonality factors, which leads us to introduce some other 
regressor variable(s) in Chapter 7 to take seasonal factors into account. 

Since scatter and residual plots can be quite subjective, it would be helpful to have a 
more objective analytical approach for assessing the linearity of the model. Unfortunately 
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Figure 3.14: Scatter plot and fitted line of jewelry sales data 
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Figure 3.15: Scatter plot of residuals versus fitted values 
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few tools seem to be available for doing this, and for most data, particularly those from 
unplanned experiments, the techniques in (i)-(iii), along with judgement, are those most 
often used in practice. On the other hand for designed experiments, of the type that 
occur in industrial situations or clinical situations, there is an analytic test, the “lack of 
fit” test that is widely advocated. 


3.9.1 The Lack of Fit Test (LOFT) 


The basic idea of the LOFT is to obtain an independent estimate of c?, other than that 
given by s?. These two estimates can then be compared in a suitable test to determine 
whether the linearity assumption is viable. 

To be more formal, suppose we wish to decide which one of the two models 


Y; = Bo + biti +E 1<i<n, (3.194) 


or 
Yi = bo + bizi +7; tei, l<i<n, (3.195) 


best explains our data. Here, 7}; represents the departure of the model from the simple 
linear regression model. As before, we assume in both cases that (c;);.., are independent 
N(0,07). Thus, we wish to test 


Но: =0, 1<i<n, (3.196) 


against 
H; : at least one n Z0, l<i<n. (3.197) 


To construct a test of Но consider the estimate of о? given by s? when the data have 
been fit by least squares. If Ho is true then as shown in Theorem 3.2 E (s?) = o? , while 
if Н; is true it can be shown that 


E (s?) n+ 2130) |E (v) -E(£)| (3.198) 


where Y; is the least squares estimator of Y;. The term Е (У;) – E (fi) represents the 
bias in estimating Y; by least squares if (3.195) is the true model and can be shown to 
be a function of 7;. Of course if Ho is true, then E(Y;) = E (%) and E (s?) = c? as we 


know. From (3.198) we see that if H, is true then s? will tend to be larger than o?. Ќо? 
were known, then we could reject Но if the ratio 82/0? was sufficiently large. In general, 
however, o? will not be known, so for this procedure to be useful we must have another, 
independent estimate of o?. In situations where we have more than one observation at 
some of the points x; then such an estimate may be obtained as follows. 

Suppose that we have m distinct z;-values, 1 < j < т and at each point х; we have 
n; observations y;;,1 < à < nj where at least one n; > 1. Now for each ту we can obtain 
an estimate of с? by the standard formula 


o ый _ 
Bye Dd (vis — 35) (3.199) 
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where 


At points where n; = 1 we take д; = 0. Using (3.200) and pooling the estimates ó7 
of о? in the usual way we get a further estimate of c? 


m 
M: 
= 
чо 
| 
= 
Q> 
— 


2 
P п-т 


(3.201) 


б 


where п = УЕ т; is the total number of observations. If the errors are independent 


N(0,0?), then 63/0? is x? (n — m). 
The quantity (n — m) 6; is called the pure error sum of squares, whence the subscript 
2 
p 
Using 6; we can now reject Но if 


pong 


Q = 2/85 (3.202) 


is sufficiently large. To complete the test, we will need to obtain the distribution of 
(3.202) and for this a slight modification of (3.202) is necessary. 
If we subtract (n — m) a from (n — 2) s* we get a new sum of squares 


SSror = (п — 2) s? — (n — m) 67 (3.203) 


called the lack of fit sum of squares which is an estimate of the bias term in (3.202). 


Using this in (3.202) gives 


SS + (п = т) 52] / (n - 2 
О = оша шыл 0 (3.204) 
op 
Rearranging (3.204) we see that we can reject Ho if 
SSror/07, (3.205) 


is sufficiently large. 
If the errors are №(0, с?) then SSror/c? has a x?-distribution with (n—2)—(n—m) = 
m — 2 degrees of freedom. Therefore an equivalent test can be based on the F-ratio 


_ [$81or/ (m – 2)] 


F AS (3.206) 
ó 
p 
and we reject Ho at level a if 
F > | ee eer ee (3.207) 


where P {F > fm—2n—m,a} = a. 

If (3.207) is satisfied we say that the model suffers from lack of fit, so that the simple 
linear model appears to be untenable. In this case the least squares analysis discussed 
in Section 3.2 should not be used, and further analysis to determine the source(s) of the 
bias in the model should be performed. Often, the cause of the problem is that other 
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explanatory variables have been omitted, and if the variables which have been left out 
can be ascertained, then further fitting and analysis via multiple regression techniques 
can often deal with the problem. This will be taken up in Chapter 5. 

If Ho is accepted, then the linearity assumption is tenable and further analysis of the 
model such as significance testing for 6, = 0, is legitimate, and one may proceed as in 
Sections 3.3-3.7. This assumes, of course, that all the other assumptions made there are 
valid. These should be tested as well and is usually done through residual plots. 

Before giving a numerical example we summarize the procedure for using the lack of 
fit test in fitting a simple linear regression model with independent N(0, c?) errors. 


(i) Fit the model y = Во + £ 4x to the data using least squares and construct the 
usual ANOVA table, but do not use the F-ratio to test for the significance of the 
regression. 


(ii) Calculate the pure error sum of squares ($Spg) and subtract it from SSE to get 
the lack of fit sum of squares (SSror). 


(iii) Form the F-ratio 
~ SSror/ (m v 2) 


2 
Tp 


Е (3.208) 


and compare this to the tabulated fim—2n—m,a value (typically a = 0.05). 


(iv) If F > fm-2,n-m,a then the model displays lack of fit and the simple linear model 
is not tenable. In this case, further analysis will be necessary to determine an 
appropriate model. 


(v) If there is no significant lack of fit, then one can entertain the model (3.1) as plausible 
and proceed to test for the significance of £,, etc. This procedure is conveniently 
carried out by expanding the ANOVA table as shown below. 


Table 3.20 ANOVA Table with Lack of Fit Analysis 


Source df Sum of Squares Mean Squares F 
Regression 1 SSR MSR= ie ris 
Residual n—2 SSE MSE= MES 
Lack of Fit т—2 SSror MSror- LOF LOF 

m —2 MSpr 
Pure Error п-т SSpg Mss - РЕ 
Total m= 1 SST 


Note: m = number of distinct x values 


Example 3.22 In a manufacturing process, the effect of temperature on the color 
of a finished product was determined experimentally. The data collected were as follows: 
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Table 3.21 Color Data 
Temperature (x) Color (y) | Temperature (x) Color (y) 


Determine if a linear model is plausible for this data. 
We begin by making a scatter plot of (x,y). This is shown below in Figure 3.16. 
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Figure 3.16: Scatter plot for color data 


From the scatter plot there appears to be an overall downward trend in the data. 
Since we have repeat points we can use the lack of fit test to test analytically if the linear 
hypothesis is tenable. 

We begin by fitting the least squares line to the data. This gives the fitted line 


у = 2.536247 — 0.0047177z. 


Using this, we find that SSE = 0.098938. To calculate the F-ratio for lack of fit we need 
to compute the pure error sum of squares. These calculations are summarized below. 
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Table 3.22 Calculations for SSpr 


Temperature (x) SSE; df 
400 0 1 

410 0.02 2 

420 0 2 

430 0.02 1 

440 0.02 1 

450 0.02 1 

Total 0.08 8 


This information is summarized in the following ANOVA table. 


Table 3.23 ANOVA Table with Lack of Fit Analysis 


Source df Sum of Squares Mean Squares F 
Regression 1 0.110395 0.110395 14.51 
Residual 13 0.098938 0.007611 
Lack of Fit 8 0.018938 0.010000 0.38 
Pure Error 5 0.080000 0.003788 

Total 14 0.209333 


R? = 0.5273 


From Table A.4 we find that /5 в 0.05 = 3.69 and the observed lack of fit F = 0.38 < 3.69 
so that the hypothesis of lack of fit is rejected and we accept the linear model as tenable. 

In this case, assuming that the errors are independent №(0, с?), the F test for the 
significance of the regression is appropriate. Again from Table A.4 we find that f1,3,0.05 = 
4.67 and the observed F ratio 14.51 exceeds this. Thus, we can conclude that 8, # 0 at 
the 5% level of significance. 

Keep in mind that even if lack of fit is rejected and the regression is found to be 
significant, this does not mean that (3.194) is the true model. The best that we can 
say is that the data appear to be consistent with this hypothesis. Further investigation 
might prove otherwise. 

When the lack of fit test is not appropriate, then other means are needed to check 
model adequacy. Of course we can use T and F statistics as before, but they may 
be significant even if (3.1) is not the true model. Our previous analysis suggests that 
residuals 2; should be useful in this respect. Intuitively, if (3.1) is the true model, then 
we would expect that the residuals should behave like a random sample of size n of а 
N (0, o?) random variable. To the extent that the residuals do not appear to behave that 
way, this will provide evidence of the inadequacy of (3.1). The most common way of 
doing this is to make various graphical plots of the residuals and examine these plots 
for apparent deviations from the model assumptions. In this section we will examine a 
number of basic plots and in Chapter 6 these ideas will be extended to multiple regression 
models. Before doing this, we state some additional properties of £j, 1 < 1 < n. 


Theorem 3.9 Let ё;,1 € i € n, be the residuals from the least squares fit of (3.1). 
Then, 


(i) E (êi) =0,1<i<n; 
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(ii) Vare,) = в =с? t — 


+ ~ N(0, 1); (3.209) 


ё; 
(0) Соо (8, Yi) =0,1<1< л. 


Proof. (i) By definition, ё; = Y;—Y; so that E(£;) = E (Y = f) = Е(Ү,)—Е (9i) | 
But, E (¥i) = E (By + уж) = E (Bo) Е (1) = вона, = EQ) since (8,81) 
are unbiased. Thus Е (2;) = 0, as required. 

(ii) Var (2;) = Var (Y — f) = Var (Y;)+Var (f. j- —2Cov - 1. From (3.69) and 
(3.74) Var (2) =o? Е + (x; = 2)? /Sss| and Cov (5 f) =0 2 [t/n + (x; = ж)? | 
Thus, 


Var (ê) = о?—о? 1, (mmy 
о n Sis 
ШО gee 
= o l1- |=- + dace аА |s, 3.210 
А | xen Sux | ( ) 
(iii) Now 
Cov (ê; êj) = К, 


since by independence Cov (Y;, Y;) = 0. 
But, 


Cov xy) = Cov (By + д, +2581) 


Var (40) + (ж + у) Cov (Bo, à.) + Var (д) . (3.212) 


From Theorem 3.1 Var (40) = о? [1/n + 7/5, and Var (д) = o° J Sz. Also, (3.68) 
gives Cov (до, ёа) = —0°T/ Sze. Thus, 
А; 1 2 (шж+хж;)% т; 
; ; = 2) 7 LN UT wd 
(ne) eec 
= о? E Е d | (3.213) 
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A similar calculation shows that 


Cov (Y. T Ў) + Соо (Y; 7) = 20? E + = ые (3.214) 
= совае е ao 


(iv) If є; is N(0,0?), then ё; is normal as well, since ё; is a sum of the Y;'s (show 
this). From (i)-(ii) E(&;) = 0 and Var (é;) = o2. Hence, the normalized random 
variables Z; = ё;/ог, are N(0, 1). 

(v) Cov (8; ре Cov (Y; =з: у = Cov Qf) — Var \ From (3.213) апа 


(3.214) Cov (Y S) = Var (Ӯ) = о? |L/n + (zi – т)? [Sex| so that Cov (2,0) = 0 
as required. M 


From (3.215) one can show that Cov (2;,2;) = 0 for large n. Hence, as stated above, 
if the errors are independent N (0, с?), then for large n the standardized residuals 2; /ог, 
should behave as a random sample from a N (0,1) random variable. 

Of course, for practical purposes one needs an estimate of о? in order to calculate s 
There are a number of ways of doing this. The most well-known procedure is to estimate 
с? by s? and Z; is estimated by 


E — — € (3.216) 


i 1 КЧ ЧЕ. EA 


which are usually referred to as studentized residuals (sometimes called internally stu- 
dentized residuals). Again, for large n we expect that f; should behave like a random 
sample from a N(0,1) random variable. 

Although the ordinary residuals, 2;,1 < i < n, are most commonly used for plot- 
ting, there is another class of residuals which are of increasing importance in regression 
analysis, the PRESS residuals. These residuals are obtained by fitting the model (3.1) 
omitting the i-th data point. The resulting estimates of Зо and £; are denoted by B. jj 
and б.г and E (Y;) is estimated by 


Yi) = Вока + 281—2): (3.217) 

Then, the i-th PRESS residual is defined by 
ёр = Yi - Yc à. (3.218) 
'These can be normalized by an appropriate estimate of с = Var (êa) giving the 
externally studentized residuals. In contrast to the &;, these residuals are more amenable 
to mathematical analysis and provide a non-graphical means for residual examination. 


Further discussion of these will be given in Chapter 6. For now we concentrate on using 
ё; and f. 
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3.9.2 Residual Plots 


We now present a number of diagnostic residual plots which enable us to determine 
violations/departures in the assumptions in the simple linear regression models. These 
include histograms of the residuals, normal residual plots and plots of the residuals versus 
the fitted values j;,1 € i < n and the independent variables т;,? = 1, 2,...,n. 


Histograms 


The histogram is one of the simplest graphical summaries used to visualize the patterns 
of the residuals, £;,2 = 1,2,..,n. The values of ê; may be arranged into a frequency 
distribution (with appropriate size of classes of residuals) and then plotted in a histogram. 
The horizontal axis (center is zero) represents the range of values of the residuals and 
the vertical axis indicates the frequency of each class of residuals. If the shape of the 
depicted histogram is close to an approximately bell-shaped/mound-shaped curve, there 
would be no reason to suspect that the normality assumption has been violated. 


Normal Probability Plots 


To obtain normal probability plots we define the cumulative distribution function (cdf) 
of a N (0, 1) random variable by 


z 1 2 
Ф (2) = | 77 0 dg (3.219) 
and define percentiles by 

Ф (4) =p (3.220) 


where р is a given probability. Then q, the p х 100th percentile is given 
q— 97 (y) (3.221) 


where Ф! is the inverse cdf of Ф. 
To make normal residual plots we order the residuals 2;,1 < i € n, by 


£a) < ё) S++ S ёа) (3.222) 


where ёү;у denote the ordered values of the residuals. Similarly we can order the stan- 
dardized residuals f;,1 < i < n, as 


fa) < f(2) <...< Fin): (3.223) 


The ordered residuals are then plotted against 9^! [(i — 1/2) /n|,1 € i € n. 

From the fact that E [e(;] ~ Ф! [(i — 1/2) /n] if the errors are normally distributed, 
then the points should lie on an approximately straight line y = т (y is the axis for £i or 
fa). Such plots are also useful for detecting outliers or dubious observations. If the plot 
is S-shaped, it indicates that the distribution has relatively light (or short) tails. On the 
other hand, a heavy (or long) tailed sampling distribution tends to look like а backward 
S. Positively skewed distributions tend to have a J shape while it has an inverted J shape 
for negatively skewed distributions. 

Hence, normal probability plots can be used to detect departures from the normality 
assumptions in (3.1). 
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Variable Plots 


(i) Since, asymptotically the studentized residuals f; do not depend on c a plot of f; 
versus x; can be used to detect deviations from the linearity assumption and/or 
violations of constant variance. 


(ii) From (у) of Theorem 3.9 we see that ё; and Y; are uncorrelated, hence if the model 
(3.1) is true, a plot of ё; or f; against ў; should exhibit random scatter about a 
horizontal band as indicated in Figure 3.17(a). If we use f;, then these form an 
approximate random sample from а № (0, 1) random so it is unlikely that |?;| > 2. 
Hence, most of the residuals should fall between +2. Residuals that fall outside this 
band, may indicate discrepant observations - called outliers - that require further 
investigation (in Figure 3.17(b). Similar statements apply to residual plots against 
zi. Violations in the assumption of constant variance can also be detected using 
externally studentized residuals similar to the plots of f; against x; or $;. In this 
case the residuals will display some systematic behavior in either j or x such as 
shown in Figure 3.17(c) or Figure 3.17(d). Such patterns often occur because the 
error distribution is not normal, and have variances which depend on the mean. 
For example, a Poisson random variable X with parameter А, E(X) = А and 
c (X) = VÀ = E(X). This suggests that 22 ~ 0; giving rise to the funnel pattern 
shown in Figure 3.17(c). Since count data often follow a Poisson distribution, such 
non-constant variance should be suspected for such data. 


& ài 
ӯ 
(а) Null plot 
ê . * 2 2 
мын, 
(с) Nonconstant variance (d) Nonlinearity 


Figure 3.17: Typical residual plots 


(iii) In some instances residual plots can be used to detect violations of the independence 
assumption. In particular, if the independent variable is time, such as in Example 
3.5 (Tractor data) or economic models (stock price, GNP, unemployment rates, etc.) 
then systematic patterns in the residuals may indicate serial correlation between 
the observations as indicated in Figure 3.18. 
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(iv) I Charts: In these residuals are plotted in the order the observations were taken. 
These can be useful in identifying serial correlation in the data. 


Figure 3.18: Residual plot if serial correlation exists 


Example 3.23 (Tractor data) Using the Tractor data in Example 3.5, we illustrate 
residual plots to detect any violations in the model assumptions. In Figure 3.19, although 
the normal plot is roughly a straight line, the histogram indicates that the errors are not 
normally distributed, it looks rather uniform. The I Chart has a somewhat systematic 
pattern in the plot. 


Example 3.24 (Drink delivery data) Using data in Table 3.5, in Figure 3.21 we 
show variable residual plots: histograms, normal plots of f;, and plots of f; versus ĝi. 
Also the plot of f; versus z; (number of cases) which looks like the funnel type in Figure 
3.21. 


Example 3.25 (Birth wight data) We use Example 3.16 to illustrate variable plots 
of the residuals. Figure 3.22 shows a normal plot of ?;, an I Chart of the residuals, a 
histogram of f;, and a plot of f; versus the fitted values $;. The normal probability plot 
looks reasonable. Also, Figure 3.23 shows the scatter plot of the standardized residuals 
f; versus the variable x (age). 


Example 3.26 (Clark county population data) Figure 3.24 shows variable plots for 
the linear fit. There are normal plot of f;, an I Chart of residuals, a histogram of f;, 
and a plot of ?; versus fitted values j;. Clearly, the normal plot does not seem to be a 
straight line. This indicates that the errors are not normally distributed. This can be 
verified in the histogram and by the plot of residuals versus fitted values. 


3.10 ‘Transformations 


In developing the theory for the simple linear model we have made a number of assump- 
tions concerning the relationship of Y to x, the errors є; and the accuracy of measurement 
of the independent and dependent variables. If some of these assumptions are not true, 
then the analysis outlined in Sections 3.2-3.8 may lead one to erroneous conclusions. 
If these violations are detected, say through residual plots or by prior information, the 
question arises as to what, if anything can be done to model our data. 
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Figure 3.19: Variable plots of residuals f; for Tractor data 


A general strategy for dealing with deviations from linearity, normality of errors 
and/or homoscedasticity, is to transform the variables in some fashion so that the trans- 
formed model satisfies, at least approximately, the basic assumption used in Sections 
3.2-3.8. We shall pursue this approach here, and then it will be generalized to deal with 
multiple regression models in Chapter 6. Needless to say, trying to correct any or all of 
the possible violations of the basic assumptions is only part science, and one may need 
to rely to a great extent on judgement and knowledge of the process generating the data 
along with the formal tools of statistical analysis. 

One should keep in mind that most of the time there will not be a one-to-one cor- 
respondence between observed model inadequacies and possible solutions, so that the 
remedy finally chosen may depend on other than just mathematical considerations. In 
some situations there may be no adequate solutions at all. 

The three problems that we shall examine are: 


(i) Possible non-linearity of the mean response E(Y;) in both z and (8o, £,); 
(ii) Possible non-normality of the errors; 


(iii) Heteroscedasticity of the errors. 


We focus on the cases (i)-(ii). Case (iii) will be discussed in Chapter 6. 
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Figure 3.20: Variable plots of residuals f; for drink delivery data 
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Figure 3.21: Residual plot: f; versus z (number of cases) 
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Figure 3.22: Variable plots of residuals 7; for birth weight data 
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Figure 3.23: Residual plot: 7; versus x (age) 
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Figure 3.24: Variable plots of residuals 7; for Clark county population data 


3.10.1 Transformations of x 


If the true model is of the form 
Yi = bo + 8,9 (zi) + €i (3.224) 
where д (т) is a known function of z, then letting 2; = g (2;) (3.224) takes the form 
Yi = bo + 812: +=:,1<15<п, (3.225) 


and we have а linear model іп z. In this case if the errors are independent and №(0, с?), 
the analysis for (3.225) can be carried out as before. The transformation z = g (x) merely 
rescales the independent variable and has no effect on the linearity of the model, since in 
statistical terms linearity is determined by the linearity in the parameters and not that 
of the variable =. 

The need for a possible transformation of the independent variable may be indicated 
by the plot of ê; (or f;) versus z or curvature in the plot of ё; versus ў; and/or theoret- 
ical considerations; the latter being perhaps the best indicator as to the choice of the 
transformation g (т). 


Example 3.27 Some of the transformations of x for typical nonlinearities in т are 
VX, exp (X), and exp (—X). 


Although theoretical considerations concerning the origin of the data can often be 
helpful in selecting a transformation, in most cases, if such a transformation is suggested, 
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there may not be any immediately obvious choice for g in (3.224). In such a situation 
several procedures have been suggested for letting the data guide the choice. For example, 
Box and Tidwell in [12] give a method for the determination of А in a model of the form 


Y = f + brò +e. (3.226) 


As this procedure requires the use of multiple regression techniques we will defer our 
analysis of this method to Chapter 6. 


3.10.2 Transformations in z and y 


If a simple transformation of the independent variable is insufficient to linearize the 
model, it may be necessary to investigate transformations of both x and y. In this regard 
there are а number of standard functional relations. which can linearize the model using 
a variety of algebraic and/or analytic manipulations. Such nonlinear models are usually 
referred to as intrinsically linear. 

As an example, consider the model 


Yz = Bge i гь (3.227) 


where B, > 0 and =, is a lognormal random variable (i.e., log =, is normal). Then taking 
the logarithm of Y, we get 


log Y; = log 8) + 81x + loge;, (3.228) 


which is a linear model in the transformed variables Zy = log Yz, 80 = log 80, 8; and 
el = loge,. In this case, if the errors loge, are independent and №(0, с?) then log 8o 
and B, may be estimated by least squares and the model analyzed using the techniques 
developed previously. 


Example 3.28 Consider the exponential growth model, Y; = exp (Bo + 614+ £x), 
for the Clark County population data (in Example 3.20). Then the model is, in fact, 
intrinsically linear, hence it is linearizable. By taking the logarithm of y, as we have seen 
in Figure 3.13, the relationship between the size of the population and time (in years) 
shows almost linear growth. 


One should note that in order for a model to be intrinsically linear, both the pa- 
rameters and the errors must appear linearly in the transformed model. For instance, 
if 

Y, = boef!” + Ez (3.229) 
then taking the log of Y, will not linearize (3.229), since log (В еб‹® + Ex) # log (Boe? 12) + 
log Ez. 

For convenient reference we give a list of commonly occurring intrinsically linear 
models and their equivalent linear forms in Table 3.24. In addition, graphs of the shapes 
of these functions are given in Figure 3.25(a)-(e). These may be helpful in identifying 
possible transformations to apply after examining a scatter plot of the data. Since 
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the graphs of different functions have similar shapes, keep in mind that more than one 
transformation may give an acceptable fit over a given range of data. 


Table 3.24 Intrinsic Nonlinear Forms and Transformations 


Prototype Form Type of Transformation 
Y = exp (lo + 6,2) | logarithm: Y' = log Y 


ue 


a = logg 


Y = 80 + 8,logz logarithm: x’ = log x 
BE Е reciprocal . Y'21/Y 
B Bo Br reciprocal & logarithm: 

Y 1/ (1+ ePo 1 ) Y' = log 1/Y — 1) 


Figure 


Linearized Form 


3.25(b) 
3.25(c) 


Y' = log By + Bio" 


Y* = bo + Biz 


5170 Bo 6, <0 
Bo 
0 x 0 = 
(a) Exponential functions Exponential functions 


В>0 


x 
(c) Logarithmic functions 


B0 


0 


x 
(а) Hyporbolic functions (е) Inverse Exponential functions 


Figure 3.25: Curves of linearizable functions 


3.10.3 Box-Cox Transformations 


Since an appropriate transformation of the data may not always be immediately apparent 
from the examination of scatter plots, it is helpful to have procedures which will allow 
the data to help select the form of the transformation. In this section we will discuss 


a widely used procedure developed by Box and Cox [9] and generalized by a number of 
others in [6, 20]. 
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In situations where the errors have a nonnormal distribution, such as in the logarith- 
mic model (3.228), one generally would like to have a transformation of Y which not 
only linearizes the model but also transforms the errors to be approximately normal. A 
family of transformations, which includes the logarithm that has been found useful in 
this regard is the power family 


A_1)/A, ASOD, 
y= 4 (m )/ = (3.230) 


where we assume that our original data y is positive. (If not, then one can add a positive 
constant to each observation and analyze the resulting transformed data.) One should 
note that logy is the limiting value of y? as А — 0. That is, 


_ yl 
lim жш ж log y. (3.231) 


Suppose now that our data Y;,1 < i < п, are such that yu <i <n, are 
independent and N (0,0?) and 


E | = By 8,24, (3.232a) 


so that ү?) satisfy the conditions for the simple linear regression model. Since the 
value of А is generally unknown, our problem will be to simultaneously estimate all of 
the parameters (А, 89,8 ,0). Although a number of procedures have been proposed for 
doing this [27, 20, 6], maximum likelihood estimation appears to be the most popular. 
This is the approach we will follow. 

If we let 


Hi = Bo + 8,2; (3.233) 


then it follows from the change of variables formula [40] that the density of Y; is given 
by 


T SUM ro er 3.234 
fy, (i) SUE apd 253 D Hi| рУ; (3.234) 
where | a /A A 

(A) _ у — 1) JA, 0, 


The likelihood function L for the n observations (yi, 02, ..., уп) is then given by 


п 1 n 1 n 1) 2 
L= [ÎI fy; w) = ==) exp? -— D - ш] Ј(А 3.236 
[L^ o (=) > PM ET | 0) (8280) 
where J (А) = П" 9) = (П, yi). The log likelihood £ = log L is then given by 


1 2 
Pas" bron н] аы е ES = A + log J (A). (3.237) 
2 202 = 


To find the MLEs of (A, 85, 8,,0) we differentiate С with respect to these parameters 
and then solve the resulting equations obtained by setting these derivatives equal to zero. 
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Unfortunately, these equations do not have an explicit analytical solution, so we will have 
to resort to numerical techniques for solving them. 

Differentiating (3.237) with respect to (8o, 81) it follows easily that the MLE estimates 
of (85, 81) (denoted by By (А), B, (А)) are given by finding the least squares estimates 
of (8, 61) using the transformed data y. Thus, if A is known, the remaining analysis 
can be carried out using the usual least squares theory. The remaining problem is to 
estimate c and А. 

Differentiating (3.237) with respect to с shows that the maximum likelihood estimate 
8 (A) of o is given by 


К „\!?? 
ё(Х) = E Ус P? - af] | (3.238) 


i=1 


where g = By (A) +8, (A) zi. Thus the value of С maximized with respect to (8), 61, 0) 
is given by 

Lirak = -5 log 2r — n log ô (A) + log J (A) — 2 (3.239) 
To estimate А we must maximize Lmax with respect to А. Since the constant — (n/2) log 21 — 
n/2 is unimportant in this respect, it suffices to maximize 


L’ = —nlogé (A) + log J (А) (3.240) 


As £' is a complicated function of А some type of numerical procedure is needed to do 
this. One method, suggested in [27], is to start with a plausible range of A, say [—2, 2], 
evaluate £’ at a set of values in [—2, 2], make a smooth graph from the resulting points, 
and then select the value А which maximizes £' by inspection. Draper and Smith in [27] 
state that 10-20 evenly spaced points is usually adequate. 

Having done this, one can then obtain an approximate confidence interval for А and 
test the hypothesis Ho : A = 1 for the need of a transformation. If Ho is rejected, then 
we transform the data using А and the MLE of the parameters are 


b Bo (А) By (А) 5 (4) ) | (3.241) 
This test can be performed using the approximate confidence interval for А, which is 
based on the difference between the two likelihood ratios, Lmax (4) — Lmax (Ао), [27, 5]. 


Hence, an approximate (1 — a) 100% confidence interval for consists of those values of 
Ао which satisfy the inequality 


— toga? (3) - [- 105? QJ] < 2x (2) (3.242) 


where x2 (1) is the upper a-th percent point of the y?-distribution with one degree of 
freedom. 

Some authors recommend rounding А to the nearest 1/4 or 1/3 for ease of interpre- 
tation. For example, if À — 1.43 then the value À — 1.5 would be used in (3.240). This 
procedure seems reasonable and we would recommend it. 

Before presenting a numerical example of this procedure let us point out that stan- 
dard regression analysis software may not directly accommodate this method due to the 
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necessity of having to compute L’. However, a slight modification of the method enables 
one to use standard regression software for estimating the parameters. 
For this, observe that in (3.240) 


zog ЕТ TO) Sloe) (LO) (3.243) 
Thus to maximize L’ it suffices to minimize 
Е (3.244) 


О a 


where ў = ([I4 yi)” " is the geometric mean of the observations. Now a little algebra 


shows that ns? (А) is the sum of squares of the residuals obtained by regressing 
(3.245) 


on z. Using this observation we obtain the following algorithm for fitting (3.240). 
(i) Choose a range of A, I = [Amin, Amax] and points А Є I to evaluate s (А) as before. 


(ii) Regress y? / (j)^^! on x and obtain ns? (А), the sum of squares of the residuals, 
from these regressions. 


(iii) Plot the values (A, ns? (А)) and obtain А as before. 


(iv) With the value of À chosen above, regress y) on т and continue the analysis on 
the linearized model. 


Example 3.29 (Clark county population data) Recall the Clark county population 
data in Example 3.20. We selected values of А, ranging from —1.0 to 1.0, and for each 
chosen A the transformation (3.230) was made and the linear regression of y?) on x was 
obtained. 

Note from Table 3.25 that the Box-Cox procedure identifies a power near А = 0 as 
being the optimum value. 


Table 3.25 Values of SS E for Selected Values of А 


34, 951, 428, 737 | 0. 5, 144, 162 
10, 208,672,414 | 0. 37, 404, 370 
2,505, 000,472 | 0. 461, 173, 312 
437,511,496 | 0. 2, 625, 347, 048 
35,247,919 | 0.8 | 10,652, 489,336 
4,898,279 | 1.0 | 36,308,540, 498 


The approximate 95% confidence interval for А is specified on the graph (—0.577 < 
А < 0.481) in Figure 3.26. We see that the validity of using А = 0 (i.e., log Y) is confirmed 
by this graph and that the transformation is well estimated. 
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Figure 3.26: An Approximate 95% Confidence Interval for A 


Application of the natural logarithm transformation to the original data gives us the 
transformed data in Table 3.26. 


Table 3.26 Transformed Values y? = log Y 


Obs. No. | Year (x) Population (Y) log. Y 
1 1920 4,859 8.4886 
2 1930 8, 539 9.0524 
3 1940 16, 414 9.7059 
4 1950 48, 589 10.7912 
5 1960 127,016 11.7521 
6 1970 273, 288 12.5183 
7 1980 463, 087 13.0457 


3.11 Exercises 


3.1 


3.2 


Suppose we have the simple linear regression model Y; = 8,+82;+=;,1 = 1,2,..., n. 
Define the residuals &; = y; — ĝi, i = 1,2, ...,n. Show that 


(a) У êi = 0. 

(b) елин = afi 

(c) 354 Bis = 0. 

(d) 517 = 0. 

А researcher considered a simple linear model for analyzing a certain data set with 
n = 26 pairs of observations. From calculations, he obtained estimates: Bo = 8.0, 
f, = 2.2 and 6? = 9. 

(a) If z = 7, what is the value of Y? 

(b) Find the (approximate) distribution of Y; specifying E (Y;) and Var (Y;). 

(c 


) Suppose that an observation on Y is made for z — 5. Find the probability that 
Y falls between 16 and 22. 
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3.3 Suppose Bo and B, are the least square estimators of the simple linear regression 
model paremeters. Find Cov (Bo, B,). 

3.4 Verify the expression in (3.29). [Hint: Use the Pythagorean Theorem. | 

3.5 Find the maximum likelihood estimator of о? for the error model in (3.4). 


3.6 Given the following partial calculations which were made from a sample of size 
n = 42: 
E=6.1, Seq = 9.74, By = 3.45, 8, = 1.81, 6? = 43. 


(a) Construct 90% confidence intervals for 8o, for 9, and о?. 
(b) Test Ho : 8, = 0 at = 5%. 
(c) Obtain a 90% confidence region for 8, and 84. 


3.7 For the simple linear regression model Y; = 6) + Bixi + £j, $ = 1,2, ..., n, the sum 
of squares due to regression is defined by 


558= У`( 7). 
i=l 


(a) Show that ĝi — y = 3, (x; — ж). 
(b) Use (a) to show that У, (ĝi — 7) = 828,2. 
(с) Use (b) to establish the relationship 


(c) Show that E (MSR) = о? + 818. 


3.8 A group of researchers studied the effect of the molar ratio of sebacic acid on the 
intrinsic viscosity of copolyesters. The following table provides the data. 


_Molar Ratio (z) Viscosity (y) 


0.3 0.44 
0.4 0.55 
0.5 0.57 
0.6 0.70 
0.7 0.58 
0.8 0.34 
0.9 0.20 
1.0 0.45 


a) Draw a scatterplot for these data. 

b) Fit the simple linear regression line: ĝ = Bo + Bix. 

с) Calculate the residuals, ё; and plot them. 

d) Is the model considered in (b) significant? Use œ = 0.05. 


NON м 
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3.9 Observations on the yield of a chemical reaction taken at various levels of temper- 
atures were recorded as follows [90]. 


Temperature x (°С) Yield y (%) 


150 77.4 76.7 78.2 
200 84.1 845 83.7 
250 88.9 89.2 89.7 
300 94.8 95.9 94.7 


Suppose that a simple linear regression model was postulated. 
(a) Draw a scatterplot for this data. 

(b) 
(c) 
(d) Find the standard error of Ba, б (21), and perform the t-test for testing Ho : 
Ву = 0 versus Hi : 8; > 0. Take a = 5%. 


(e) Construct a 95% confidence interval for 6}. 


Find the least squares estimators Be and By. 


Estimate the mean square error (MSE) 6”. 


(f) Construct а 90% prediction interval for the mean response at x = 275°C. 
3.10 Consider the data given in Exercise 3.9. 
(a) Construct the ANOVA table with lack of fit and pure error terms. 
(b) Is the regression model significant in (a)? Take o — 0.01. 
(c) Show numerically that F = #2, where t was in Exercise 3.9(d) for testing 8; = 0. 
(d) Test the lack of fit of the model. Use o — 0.05. 
(e) Compute R? and give an interpretation. 
3.11 Consider a simple linear regression model with 69 = 0. 
(a) Verify that the least-squares estimator of f, B, is given by 


3, = DR 
17 sn 5- 
Yi ae 


(b) Show that 6, is unbiased and that Var (д) =й? Б сай 


(c) Show that s? in (3.164) is unbiased for c?. How many degrees of freedom does 
it have? 


A 2 
(d) Show that SSR = Y (8,27) 


3.12 The data below for the years 1919 to 1935 (n = 17), gives x = the water content 
of snow on April 1 and Y — the water yield from April to July (in inches) in the 
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Snake River watershed in Wyoming [121]. 


Year £ Y 

1919 23.1 10.5 
1920 32.8 16.7 
1921 31.8 18.2 
1922 32.0 17.0 
1923 30.4 16.3 
1924 24.0 10.5 
1925 39.5 28.1 
1926 24.3 12.4 
1927 52.5 24.9 


Year £ Y 

1928 37.9 228 
1929 30.5 14.1 
1930 25.1 12.9 
1931 12.4 8.8 
1932 35.1 17.4 
1933 31.5 14.9 
1934 21.1 10.5 
1935 27.6 16.1 


а) Fit a regression through the origin with the model Y; = Bizi 4-6&;,.4 = 1,2, ...,n. 
b) 

c) Construct a 95% confidence interval for 8}. 

d) Obtain the ANOVA table and compute F. 


Estimate c?. 


( 
( 
( 
( 
3.13 After the values of Y were regressed on x, the following numbers were reported: 


By = 12.5, B, = —5.73, y = 28.8. 


(a) Complete the ANOVA table: 


Source of Variation df Sum of Squares Mean Squares 


Regression 157.94 
Residual 19 
Total (Corrected) 188.25 


(b) Calculate the proportion of the Y-variability that is explained by the fitted 
regression line. 


(c) Find the sample correlation coefficient between x and y. 
(d) Obtain a 95% confidence interval for 64. [Hint: P, — SSR oa (x; — ж)? |] 
(e) Test the hypothesis Ho : 8, = 10 against Hı : 8, > 10. Take a = 5%. 


3.14 Consider a simple linear regression model in (3.4). Suppose we replace each 2; 
with cz;, where c is a nonzero constant. That is, we have the model 


Yi = bo + B4 (cxi) - Ei, $— 1,2, ..., m. 
(a) Derive the least squares estimators of 6) and f. 


(b) Estimate с?. 
(c) Are these estimators in (a) the same as (3.18)-(3.19)? 
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3.16 Show that the estimator (3.19), Bo = 9 — B,3 can be expressed as Bo ж eer ye 
where 
е 1 (2; = т)т 


n Bex 


3.17 Consider a simple linear regression model with the assumption that &;'s are i.i.d. 
М (0,07) . Suppose we reparameterize the model as 


Yi = Yo + Yı (zi — T) + €i, {= 1,2,...,т 


Let Bo and By denote the MLEs of B, and f, respectively and 4) and 4, denote 
the MLEs of Yọ and ^, respectively. 


(a) Show that 4, = f. 
(b) Show 49 Æ Êo. In fact, show that 4 = y. 
(c) Find the distribution of 4. 


3.18 Prove that two expressions for the sum of squares due to regression are equal. 
That is, show that 


КЕ Do c7 9 (c Е 
, y) УЙ (тот c 


n 


1 


3.19 After observations (z;,y;) were obtained, we postulated the regression model to 
describe the relationship by 


Yi = 012 + £j, 1=1,2,...,п 
where the e;'s are assumed to be i.i.d. N (0,07). This is a quadratic model passing 
through the origin. 
(a) Find the least squares estimator of 0]. 
(b) Find the maximum likelihood estimator of 01. 
3.20 The minimum absolute deviation line is given by the values of бо and 6; that 
minimize 


У — (60 + 624). 
1=1 


(a) Show that, for a data set with three observations, (21,01), (21,02), and (23, уз), 
any line that goes through (3, уз) and lies between (x, y1) and (21, ye) is a mini- 
mum absolute deviation line. 


Suppose we now have three pairs of measurements that are taken on a heart rate (z, 
in beats per minute) and oxygen consumption (y, in ml/kg); (127, 14.4), (127, 11.9), 
and (136, 17.9). 


(b) Find the slope and intercept of the least squares line. 


(c) Find the minimum absolute deviation line. 
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3.21 Given the data of pairs (z, y) tabulated below, and assuming normal, independent 
observations with constant variance c? : 


—24 53 -06 84 04 -10 8&7 -46 
—0.2 22 -—01 28 1.5 21 2.9 —0.5 


(a) Find the maximum likelihood estimates of 80, 8,, and с?. 
(b) Do the MLEs in (a), 8) and &,, agree with the LSEs д, and 8;7 
(c) Construct a 9096 confidence interval for the mean response of Y, when т = 0.5. 
(d) Make a normal probability plot for the residuals. 
3.22 Consider the model: Y; = fy + 812: + €;, where є; are iid. N (0, k; ! o?) and 
kj's are known positive constants, i = 1,2, ...,n. 
(a) Write the log-likelihood function, £ (8o, 84,0?) . 
(b) Find the MLEs of 8, and 8}. 
(c) Derive the normal equations and find the LSEs of 8, and д}. 


Chapter 4 


Random Vectors and Matrix 
Algebra 


4.1 Introduction 


In dealing with the simple linear regression model we were able to calculate all of the 
necessary quantities without using any special algebraic techniques. However, to ex- 
tend regression models to account for more than one independent variable it becomes 
convenient to use the more sophisticated approach afforded by matrix algebra. 

As a consequence, this chapter will be devoted to developing some of the basic con- 
cepts in this area and to proving a number of important theorems which the reader may 
not have encountered in elementary courses in matrix algebra. In addition, some proba- 
bilistic and statistical consequences of these ideas will also be examined, particularly the 
joint multivariate normal distribution. Because our statistical work in future chapters 
will be restricted almost exclusively to full rank regression models (see Chapter 5), some 
specialized topics such as the theory of generalized inverses will be omitted. Further 
algebraic results of this and a more advanced nature can be found in [112] and we refer 
the reader there for more details. 


4.2 Matrices and Vectors 


A matriz is a rectangular array of numbers as depicted in Figure 4.1. The horizontal 
sequences are called the rows of the matrix, while the vertical sequences are called its 
columns. The rows are labeled in increasing order from top to bottom, while the columns 
are labeled from left to right. 


Qil @12 ··: : Qim 
a21 . eee a G2m 
Qij 
Qn1 . eee . Qnm 


Figure 4.1: n x m Matrix 
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A matrix with n rows and m columns, n > 1, m > 1, will be called an n x m matriz 
(“n by m” matrix) and the pair of integers (n,m) its dimension. The individual numbers 
that make up the matrix are usually referred to as its elements and the number in the 
i-th row and j-th column will be called the i7-th element. In general, a matrix will be 
denoted by an upper case boldface letter A, B, X etc., and the corresponding elements by 
lower case letters aj;;,b;;, etc. If n =m, the matrix is square, otherwise it is rectangular. 

If A is an n x n square matrix, then the elements а = a;,1 < 4 < n, are called the 
diagonal elements of A, while if i £ j, а;; are called the off-diagonal elements. 


4.2.1 Some Special Matrices 


In using matrix algebra to analyze regression models а number of special matrixes occur 
repeatedly so it is convenient to have special notation reserved for these circumstances. 
If the matrix A has all its elements zero, then it will be called a zero matrix and denoted 
by 0. If the dimension is not clear from the context, then Onxm will denote then n x m 
zero matrix. 

A square matrix A whose off-diagonal elements are all zero will be called a diagonal 
matriz and often written as diag(a;) where a;,1 < i < n, are the diagonal elements. If 
a; = 1,1 < i < n, Шеп A is the n x n identity matriz written as І, or just I if the 
dimension is understood. When all the elements of à square matrix above the diagonal 
are zero, then the matrix is said to be lower triangular, while if all the elements below 
the diagonal are zero, then the matrix is said to be upper triangular. 

Given a matrix A we can form а new matrix АТ called the transpose of A by inter- 
changing the rows and columns of A. that is, the i-th column AT is the i-th row of A. 
For example, if 


к о кш о ҥн H 
m Whe 


then 


i ЛД se a 
Ew T. 


If A is n x m, then A? is mx n and AT has the important property that if you 
transpose it, then you arrive at the original matrix A. Symbolically, (AT)" = A. If A 


is square and A = A’, then we say that A is symmetric, otherwise it is non-symmetric. 
For example, 


1 2 3 
A= 3 4 
3 4 5 
is symmetric, while 
12 3 
А= |4 5 6 
7 8 9 


is not. 
As we shall see, symmetric matrices play an important role in regression analysis, 
and a number of their properties will be studied further in Section 4.6. 
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4.3 Fundamentals of Matrix Algebra 


As for real numbers, we can introduce various operations on matrices, such as addition, 
subtraction and multiplication. This topic is usually referred to as matrix algebra and 
we will take up some basic aspects in this section. Further details will be developed in 
later sections. 


4.3.1 Matrix Addition 


If A and B are two n x m matrices, they can be added together to give a new matrix C 
defined by the formula C = [ci], 1 € i € n, 1 € j € m, where 


Cij = Фу+),1<1<п,1<)<т. (4.1) 
C is called the sum of A and B and is written as 
C—A-B. (4.2) 


Note that by definition only matrices of the same dimension can be added and that 
the sum of two n x m matrices is again an n х m matrix, in which case they are said to 
be conformable for addition. Matrix addition has many of the properties of addition of 
real numbers, and the corresponding names are the same. А number of these are stated 
below and their proofs are left as simple exercises. 


4.3.2 Properties of Matrix Addition 

If A, B, C, etc. are n x m matrices, then the following properties hold: 
(i) A + B = B + A (commutativity) 
(ii) (A 4- B) +C = A + (B + C) (associativity) 


(iii) From (ii) and (iii) it follows that a sequence of n matrices can be added without 
regard to order and with no need for parentheses. For example, we have for any 
n X m matrices A, B, C, D that 


(A+B)+(C+D)=A+B+C+H+D, etc. (4.3) 


More generally, the sum of the n x m matrices A;,1 < i < p, is defined in any order 
and this allows us to use summation notation unambiguously. That is, 


Р 
А+Аз+--+Ар=} A. (4.4) 
i=1 


(iv) If A is n x m and 0 is the n x m zero matrix, then we have 


A+0=0+A=<A. (4.5) 


(v) For any matrix A we can define its negative — А, given by —A = [—a;;], and from 
this it follows easily that 


A+(-A)=-A+A=0. (4.6) 
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If we introduce the concept of matrix subtraction given by 
A-B=A+4+(-B) (4.7) 
where A and B are n x m, then (4.7) can be written more conveniently as 
A-A=0 (4.8) 


in analogy with the corresponding property of real numbers. 


4.3.3 Matrix Multiplication 


There are a number of ways of defining matrix multiplication with pointwise multipli- 
cation of matrices perhaps being the most natural. Unfortunately, that definition is not 
the one that has been found to be most useful in practice and so a different, somewhat 
more complicated method, is most often used. 

The definition of matrix multiplication is perhaps most easily motivated by trying to 
find a compact notation to express a system of n linear equations in n unknowns. For 
instance, consider the system of equations 


22+ Зу = 5 
l or + бу = 7 (4.9) 
and let А be the coefficient matrix of (4.9). That is 
2 3 
a=[2 1]. a 


and let x = (т, у)? be the column vector of unknowns. Then we can define the product 
of A and x so that the result is the left-hand side (4.10). That is, 


Ax- ete =|: eal (4.11) 


From (4.11) we see that in order to get the first row of Ax we multiply the first 
element in the first row of A by the first element of x, then we multiply the second 
element in the first row of A by the second element of x and the results are added. The 
second row of Ax is produced in a similar fashion. 

More generally, if A is n x m and B is m x p then we can define the product of A 
and B, denoted by AB, by (the ‘bars’ separate columns of B) 


AB = A [bi|b;| -- - [bp] = [Ab;|Ab;| - - - |Ab;] (4.12) 


where b;,1 < j < p, is the j-th column of B, and Ab, is the product of A with the 
column vector b;. More explicitly, if b; = (b15, b2;, ..., bs then the i-th element (Ab); 
of Ab; is given by 


(Ab;); = у, Qikbkj (4.13) 
k=1 
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From (4.12)-(4.13) we see that if C = AB is the product of A and B, then, 


m 
Cij = Уу бабы), 1<1{<т,1<7<р. (4.14) 

К=1 
In particular, it is important to note that AB is defined if and only if the number of 
columns of A equals the number of rows of B. Thus if A is n x m and B is m x p, then 
AB is well defined and AB is n x p. In this case we say that A and B are conformable 
matrices. Thus, in general, if AB is defined then BA is not defined unless p — n. In 
particular, if А and B are both square n x n matrices, then both AB and BA are 
defined but in general are unequal. If it is true that AB = BA we then say that А and 

B commute. 


4.3.4 Properties of Matrix Multiplication 


As for matrix addition, matrix multiplication satisfies a number of useful properties which 
the student should commit to memory. The details are left as exercises. 


(i) In general, AB 4 BA. 


(ii) If, however, A is n x m, B is m x p and C is p x q then 


A (BC) = (AB)C = ABC. (4.15) 
Equation (4.15) is called the associative property of matrix multiplication and is 
true for any sequence of p matrices (A;); , whose product А.А» ··· A, is well 
defined. 


(iii) If L, is the n x n identity matrix and A is n x m, then 


I,A—A (4.16) 
and similarly, 
AL, =A. (4.17) 
In particular, if A is n x n then 
AL, =һҺА = А. (4.18) 


(iv) If Onxm is the n x m zero matrix and A is m x p then 


OnxmA = Onxp (4.19) 
while 
AOnxp = Ünxm. (4.20) 
(v) When AB is defined, then 
(АВ) = ВТАТ. (4.21) 


(vi) If A is n x m and В and C are m x p, then 
A (B +С) = AB + AC. (4.22) 


Equation (4.22) is called the distributive law for matrix multiplication. 
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4.3.5 Scalar Multiplication 


In addition to being able to multiply two matrices together it is also possible to multiply 
a matrix by a real number. This form of multiplication is usually called scalar multipli- 
cation and is defined as follows. If A is an n x m matrix and c is a real number, then 
the scalar multiple of A by c is the matrix cA whose 7j-th element is just ca;;. Thus 
cA is obtained by multiplying every element of A by the real number c. This type of 
multiplication also has a number of useful properties which are stated below. 


(i) c(A + В) = cA+cB; 

(ii) (c +d) A = cA + dA; 

(ii) (cd) A = (dA); 

(iv) OA = 0; 

(vy) ТА = А and (-1)A=-A; 
(vi) (cA) = cAT; 


Again (i)-(vi) are easily proved and are left as exercises. 


4.3.6 Powers of Matrices 


As for real numbers, it follows from the associative property of matrix multiplication 
that one can define powers of a square matrix A. 
If A is an n x n matrix, then we can define A! = A and 


A? = АХА! = A x A. (4.23) 
Continuing this way 
АЗ = А? х А = А х А? (4.24) 
and by induction 
A"™=A"™!x А = А х А"), (4.25) 


A? is read “A squared”, A? is “A cubed” and A” is “A to the n-th power.” For some 
purposes it is useful to let 


AST. (4.26) 
From (4.25) and the associativity of matrix multiplication it follows that 
A" A? = A" x А" = Ат" п > 0,m > 0. (4.27) 


Ап important class of matrices in statistics (and elsewhere) are those that satisfy 
A? — А. (4.28) 


When (4.28) holds we say that A is an idempotent or projection matriz. In particular, 
L, is idempotent, as are diagonal n x n matrices of the form 


1000 0 

(7:95 30:0 0 

: 0 10 (4.29) 
0... 00 0 

0 0 0 0 


nxm 
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with p ones and n — p zeros on the diagonal and zeros elsewhere. 

In addition, if A is symmetric we will say that A is an orthogonal projection. It will be 
shown in Section 4.6 that in an appropriate sense all orthogonal projection matrices are 
of the form in (4.29). This fact plays an important role in understanding the properties 
of linear regression models with normal errors. 


4.3.7 Matrix Trace 


If A is an n x n matrix, then the trace of A, tr(A) is the sum of its diagonal elements; 
i.e. 


tr(A) — Y n (4.30) 
i=1 
Tr(A) plays an important role in many regression calculations and has a number of 
important properties. 

(i) If A = 0, then tr(A) = 0. 
(ii) If A = Т„х„, then tr(A) = n. 
(iii) If c € R then, tr(cA) = ctr (A). 
(iv) If A and B are n x n matrices, then 


tr (A +В) 2 tr(A) + tr(B). (4.31) 


(v) If A, B, C are n x n matrices, then 


tr (AB) = tr (BA) (4.32) 


and 
{т (АВС) = (САВ) = tr (BCA). (4.33) 
(vi) (АТ) = (А). 


The proofs of (i) to (vi) are left as exercises. 


4.4 Matrices and Linear Transformations 


Another way of regarding a matrix A is to see how it acts on vectors by matrix multi- 
plication. If x is a column vector of dimension p and A is an n x p matrix, then Ax is a 
column vector. From the properties of matrix multiplication it can easily be shown that 


А (х+у) = Ах + Ау, (x,y) є R? (4.34) 


апа 
А (сх) = сАх, сє К, х € R. (4.35) 


Using this observation we can regard an n x р matrix as defining a mapping from IR? to 
R” by the formula i 


А : В? — R”, x — Ax ( — stands for ‘replaced by’). (4.36) 
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Properties (4.34) and (4.35) are usually called the linearity properties of A and in this 
context A is referred to as a linear transformation or linear operator. Conversely, it is 
easily shown that any linear operator T : RP — R” can be represented by a matrix A. 

To see this, let e; — (0,..., 1, 2,0). denote a column vector with one in the i-th 
position and zeros elsewhere. Then any column vector x of dimension p can be written 
as 


p 
х= У с:е;. (4.37) 
t=1 


Hence, if (4.37) satisfies (4.34)-(4.35) using these repeatedly gives 


р 
Tx = V ^ aTe; (4.38) 


1=1 


Now, Те; € R” so it is some column vector а; = (aii, aaj, suede) ‚1 <i < р. Defining 
A as the matrix whose i-th column is aj it follows easily that 


Tx = Ax. (4.39) 


Hence, we can think of a matrix and its corresponding linear transformation inter- 
changeably. Often the geometric language is more convenient. 


4.4.1 Matrix Inversion 


If a is a nonzero real number, then а has a multiplicative inverse 1/a which has the 
property that (1/a)a = a(1/a) = 1. The existence of the multiplicative inverse of a 
nonzero real number then allows one to define an operation inverse to multiplication, 
division. In this section we take up this possibility for nonzero matrices. 

As we shall see, the process of matrix inversion is crucial to the development of much 
of the estimation theory for linear regression and our development is largely motivated 
by that fact. Although it is possible to define inverses of rectangular matrices our treat- 
ment of linear statistical models only requires that we investigate this process for square 
matrices. Even here, the subject is somewhat more complicated than for real numbers, 
since an arbitrary nonzero matrix need not have an inverse. At this point we will give 
only the only basic definitions and facts concerning inverses. Further theorems and some 
computational aspects will be considered in Section 4.8. 

For further reading on this aspect of matrix algebra and its relation to statistical 
calculations we recommend that the reader consult Refs. [45, 103). 


Definition 4.1 Let A be an n x n matrix. We say that the n x n matrix B is an 


inverse for A if and only if 
AB = ВА = L.. (4.40) 


Theorem 4.1 If A has an inverse, then the inverse is unique and will be written 
as Al. Thus, 
АТА = АА! «I, (4.41) 
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Proof. Suppose that B and C are two inverses for A, then since AB = I, and 
CA =I,, C(AB) = (CA)B=I1,B = B = CI, = С, so that В = С and the inverse is 
unique. M 


For practical purposes it is important to know which, if any, matrices have an inverse. 
Unfortunately, this is not always easily ascertained and may, in practical situations, 
require a great deal of computation. We first note that although the zero matrix cannot 
have an inverse, not every nonzero matrix has one either. To see this, let 


and suppose that 


But this is impossible since 1 0. 
For a 2 x 2 matrix 


аз 822 


А = | ы? | (4.42) 


one can easily show that А! exists if and only if the determinant of A, det (A) = 
011022 — 012021 + 0. In Ѓасё, the inverse of A is given by 


exp. 1 429 —@12 
Bo det (A) | —a21 ап | | ч 
The Equation (4.43) is a particular case of Cramer’s rule and it is well known that the 
determinant criterion and formula may be extended to n x n matrices. Since we will have 
little further use for such formulas the interested reader may consult Refs. [112, 103] for 
more information. 
For theoretical purposes a more convenient criterion for determining the invertibility 
of A is the following. 


Theorem 4.2 Let A be an n x n matriz. Then A has an inverse if and only if the 
only n-vector x satisfying Ax = 0 is х = 0. 


A proof of Theorem 4.2 may be found in [112, 103]. It's usefulness stems from the 
fact that the equation Ax — 0, when written out in full, is a system of n linear equations 
in п unknowns and standard algorithms, such as Gaussian elimination, may be used to 
solve them. If the only solution to that system is О we may conclude that А has an 
inverse. 


Example 4.1 Using Theorem 4.2 show that 


1 1 1 
А=|1 0 2 
0 1 1 
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has an inverse. 
Let x = (z,y, z), then we need only show that Ax = 0 implies x = 0. But, 


т+у+2 
Ах = т + 22 
У+ 2 


so that Ах = 0 gives 
х+у+2 = 0 
т + 22 = 0 
у+ 2 = 0 


Now subtracting the second equation from the first gives y — z = 0 and using this and 
y T z = 0 gives у = z = 0. Substituting this into z 4- y 4- z = 0 we get т = 0, showing that 
x — (0,0, 0)7 = 0. Since 0 is the only solution to Ax = О, we conclude from Theorem 
4.2 that A has an inverse. (Notice that we did not need to compute A ^! .) 

Another convenient way of regarding Theorem 4.2 is to observe that 


Ах = У: та; (4.44) 
i=1 


where a;,1 € i € n, is the i-th column of A. Hence, the condition that Ах = 0 => х = 0 
isthesameas? 7; 2;а; = 0 = z; = 0,1 € i < n. And this says the vectors aj, 1 < i < n, 
are linearly independent. That is, the only linear combination of the column vectors of A 
that can add up to zero must have all the coefficients z;,1 < 4 € n, equal to zero. Thus, 
if A is not invertible, there must exist at least one nonzero vector x = (z1,22,..., Zn) 
such that 


n 
i=l 


In this case we say that the column vectors are linearly dependent. 

More generally, if A is an n x p matrix we define the rank of A to be the number of 
linearly independent columns of A. So, if A is an л x n matrix, then it is invertible if 
and only if its rank is n. 

An important property of the rank is that rank(A) = rank(A7) so that the rank 
of A is also the number of linearly independent rows. Hence, for an n x p matrix 
rank(A) € min(n,p). A matrix having the property that rank(A) = min (n, p) is said 
to have full rank. Obviously, an invertible n x n matrix has rank n. These theoretical 
properties, although perhaps a little abstract for now, will find considerable use in later 
chapters. 


Theorem 4.3 (Some further properties of A^) 
(1) If A and B are invertible matrices, then AB is invertible and 
(AB)! =В-!А^!. (4.46) 
(ii) If A is invertible, then (А-1)! = (AT). 


(iii) If A = diag(a;),1 < i < n, and aj + 0,1 € i € n, then A^! = diag(1/a;),1 < 
і € n. In particular, 15) = L,. 
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(iv) A is invertible if and only if det (A) 5 0. 


Proof. (1) It suffices to show that (В-!А 1) AB = AB(B^!A^!) =1,. Now 
using associativity, 


(В-'А-!) АВ = B™ (А-А) B= B71 (1,)В = BB = I, (4.47) 


Similarly, AB (B^ A7!) = In. 
(ii) Since A71 A = L,, taking the transpose of this gives (A-1A)* SAT (А-1)? = 
L,. Similarly, (A-1)7 AT = L,, which shows that (A1) is the inverse of (AT). 


(iii) We leave the proof of (iii) as an exercise and (iv) is а standard theorem of linear 
algebra which may be found in [112, 103]. 8 


4.4.2 The Inverse of Partitioned Matrices 


It is often convenient to regard a matrix A in terms of various submatrices. For example, 
if 
Qil Q12 Q13 а14 
A= а 022 423 424 (4.48) 
Q31 432 Q33 434 
Q41 Q42 Q43 Q44 


then we can consider it as composed of submatrices, delineated by bars, as follows; 


Q@11 412 | 013 G14 


Q24 
(4.49) 
аз 432 | азз азд 


Q41 Q42 | а43 алд 


and this can be further abbreviated as 


а= 575 | (4.50) 


where A, B, C, D are the submatrices indicated in (4.49). When A is written in the form 
(4.50) we shall say it is in partitioned or block form. 
If A and B are conformable matrices and each is partitioned as 


А; | А2 | | Bii | Bie | 
A= , B- 4.51 
| À»21 А Bai B2? 


then it can be shown that AB can be written in partitioned form as 


АВ — | AiiBii + Ai2Bai | Arii Bir? + A12B22 | 


4.52 
Aoi Bi; + АВ | Azi Bio + A22B22 ( 


provided that the indicated products are well defined. Note that this product is formed 
by the same rule as if the elements were scalars, keeping in mind that the correct order 
must be used. 

Using (4.52) we can develop a number of formulas for the inversion of matrices in 
partitioned form. 
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wW- KALS 5 | (4.53) 


P—-A-BD'^!C, Q- D-CA^!B (4.54) 


Р-! АВО”! 
w- = ae (4.55) 


provided the requisite inverses exist. 


Theorem 4.4 If 


and 


then 


Proof. Assuming W-! exists we write it in block form as 


wia | c p: | (4.56) 
апа {һеп 
WW =I, = | 2 Г | (4.57) 


where p +q = n. Multiplying (4.53) out gives 


: A;A - B1C | A,B-BiD 
1 = 1 1 1 1 
у м = |С GBIDD] ee) 


Equating matrices in (4.58) gives 


A,A+B,C=I,, AiB+B,D = 0, (4 59) 
C,A+D,C=0, СВ +р;.р =1,. | 
From (4.59) 
A, = (I, – В.С)А-! (4.60) 
and using this in the second equation of (4.59) gives 
(I—B,C)A1B+B,D=0. (4.61) 
Thus, 
В.р –- В.СА-!В+А-!В = В, (D -СА-!В) +A7'B=0. (4.62) 
Solving for Ву gives 
B; = -A^B(D-CA-^B) = -A7!BQ"!. (4.63) 
Using (4.59) again gives 
B,D = —AjB. (4.64) 
So 
В; = —A,BD". (4.65) 
Thus, 
A, A—A,BD7'C =I, (4.66) 
so that 
A, (A - BD^!C) =I, (4.67) 
and solving for A, gives 
A, =P". (4.68) 


Similarly we get С; and Оу. We leave the details as an exercise. Bi 
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4.4.3 The Sherman-Morrison-Woodbury Formula 


In regression analysis one often needs to analyze the effect of adding or deleting a new 
observation. Doing this requires that we be able to calculate the inverse of a matrix 
perturbed by the addition of a rank one matrix. A famous formula given by Sherman, 
Morrison and Woodbury [8] (See also Rao (1973), p. 33) enables one to do this in terms 
of A^. A proof of this is given next. 


Theorem 4.5 (Sherman-Morrison- Woodbury Theorem) Let А be an n xn invertible 
matriz and let z be an n x1 column vector. If z! A~1z £ 1 then the matriz B = A—zz* 
has an inverse and 


(4.69) 


Before proving the theorem we establish another result in matrix multiplication which 
will be needed in the course of the proof. 
Let u, v and y be n x 1 column vectors, then 


(uy) v = (vu?) y. (4.70) 


To see this, we write и = (u1, U2, жыйы» у = (01,72, М)" and у = (v1, v2, stay 
Then, 


yi 
Y2 М 
uly = (ui, Ua, ..., Un) | = у шш (4.71) 
1-1 
Un 
so that E 
n n n 
(ut y) v= (>: va] Ui, (Y: Т) 09, ...; bs vat] " . (4.72) 
i=1 i=1 i=l 
Also, 
11 4111 U1U2 ^"** Ulun 
v2 290] 0909 б" Ulun 
vul = | Cas Masi Un) = | | 2 | (4.73) 
Un Unul Оло -*** Unun 
so that 
Vur VU cc VUn yı vi a wu) 
Оди]  U2U2 t Vrun y2 va ($5; 4 Vi) 
(учуу = EE wae [= | 
UnUl Ung **' Unun Un Un (a шй) 


(4.74) 


| 
— 
Е 
У 
< 
м; 
< 
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Proof of Theorem 4.5. We begin by observing that if the set of equations 
Cx = y has a solution x = Dy for all n-vectors y, then D = C7!. (This follows from 
the uniqueness of the matrix inverse.) Hence, we consider solving 


Bx =y, y € ЕК". (4.75) 


Using the definition of B, and the above observation 


Bx = Ах — ZZ x = Ax — cz (4.76) 
where c = z^ x (which is a scalar). Thus, 
Ax-—cz-—y (4.77) 
so that 
х= СА lz-A ly. (4.78) 


Multiplying both sides of (4.78) by 27 gives 
27х = с=с A !z+z А-у (4.79) 


and solving for с we get 
T zl Aly 
| l-zTA-!z 


Substituting this value of c into (4.78) gives 


(4.80) 


(zT A^ ly) Aiz = 
x= UUPDXTACd ^ + A y. (4.81) 


Letting uT = zT A^! and v = A^!z in (4.70), then 


(zT A71y) А72 = (A^!zzT A^!) y. (4.82) 
Thus, 
x = Cy (4.83) 
where RUN 
_ А22 AC zi 
= Toa, НАГ". (4.84) 


From our remark at the beginning of the proof С = B^!. M 


4.5 Тһе Geometry of Vectors 


It will be convenient to interpret some aspects of matrix and vector algebra in geometric 
terms. Just as a 2-vector (x,y)! may be interpreted as a point in the plane or as a line 
segment (geometric vector) joining the origin (0,0)7 to the point (x, y)” with a direction 
pointing from (0,0)7 to (т, у), an n-vector (21,2, ..., En)” may interpreted as a point 
in Euclidean n-space R” or as a geometric vector pointing from the origin (0,0, ..., 0)7 
to (21,22, iba s With this geometric interpretation of vectors we can introduce the 
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important notations of length and angle. Basic to doing this is the notation of the inner 
product of two vectors. 


Definition 4.2 Let x = (21,22, жыйа)" and у = (yi, Y2, ma be two n-vectors. 
The inner product (x,y) (sometimes also called the dot product) of x and y is the real 
number defined by 


(х,у) = zwi (4.85) 
i—l 


Note: If y is regarded as a row, rather than as a column vector, then (x, y) may be 
written in matrix multiplication notation as 


(x, y) = y!x = (yi; Y2; Yn) . (4.86) 


which is preferred by many authors. 
The inner product has а number of simple properties which we quote below. Because 
of their simplicity, we leave most of the proofs of these to the reader. 


Theorem 4.6 (Properties of the inner product) 
(i) (х,у) = (y,x) (symmetry); 
(ii) (x +y,z) = (х,2) + 52); 
(iti) (x,x) > 0 and (x,x) = 0 if and only if x = 0. 
(iv) If c is a real number, then 
(cx, y) = c (X, y) = (x, cy). (4.87) 
(v) If A is an m x n matriz, then 


(x, Ay) = (AT x, y). (4.88) 


(vi) (The Cauchy-Schwarz inequality) 
(х,у) € V(x, x) V (у, y» (4.89) 


Proof. (i)(v) require only straightforward algebraic manipulations and are left as 
exercises. 

For (vi) we consider the quantity Q (A) = (x+Ay,x+Ay). Expanding Q (A) using 
(i)-(iv) of the theorem gives 


Q (A) = (x,x) + 2A (х,у) + A? (у, у) (4.90) 
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and from (4.90) we see that Q (A) is a quadratic function of А and Q (A) > 0. We consider 
two cases: 
(a) if y = 0, then (x, y) = 0 and (y, y) = 0 so that (4.90) is true 
(b) if y Z 0, then Q (A) = 0 is a true quadratic and from elementary algebra the only 
way that © (А) > 0 is that Q (A) either has no real roots or two real equal roots. In these 
cases the discriminant 
4 (х,у)? — A(y, y) (x x) < 0. (4.91) 


Transposing and taking square roots gives 
ley) € V(x, x) v (У,у). 


4.5.1 Length and Angle 


From elementary geometry (Pythagorean theorem) it is well known that the length of 
the vector in two dimensions x = (x1, 22)" is given by 


4/22 + 22 = үу (х, х). (4.92) 
Generalizing this for а vector with n components we define the length of x = (21,22,..., En)” 
by 


" 1/2 
|х| = v Geox) = э) ; (4.93) 


Using Theorem 4.6 ||x|| сап be shown to have the expected properties of length. 
Among these are: 
(i) ||x|| > 0 and l[x|| = 0 if and only if x = 0; 
(ш) (ех = [el Ixl; 
(iii) (Triangle inequality) |х + у| < ||х|| + Iyl. 


Proof. Properties (i) and (ii) being elementary, we establish (iii). For this we 
consider 


(x+y,x ty) = (x,x) + 2 (x,y) + (y,y) 
Ixl? + Iyl? +2 6 y). (4.94) 


[x+ у]? 


i 


By the Cauchy-Schwarz inequality |(x, y)| < ||х|| ||y|| so that 
2 2 2 2 
Ix + yl < |12 + 1512 + 2 xl Ml = (xl + (ур). (4.95) 
Taking square roots in (4.95) gives 
lx+ y || < xti + [У (4.96) 


as required. B 
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Along with length, the inner product allows us to introduce the concept of the angle 
between two vectors. For this, assume that x Z 0, y #0 and consider the ratio 


r= кла І 
|х| |У 


From the Cauchy-Schwarz inequality —1 < т < 1, so from trigonometry there is an angle 
0 € 0 < л such that 


(4.97) 


(x, y) 
МІ 
Since the cosine is single-valued on [0,7], we have 


0 = сов! ( \х,у) ) | (4.99) 


ixil [ly] 


cos 6 = (4.98) 


We call 0 the angle between the vectors x and y. If either x or y is zero, we leave the 
notion of angle undefined. If (x, y) = 0, then 0 = cos ! (0) = 7/2 and we then say that 
x and y are orthogonal vectors. Orthogonality generalizes the notion of perpendicularity 
of vectors in the plane and three-space. 

As we shall see, orthogonality is a fundamental concept in much of regression analysis 
and much of its importance stems from the following geometric fact. Suppose that we 
have a line 1 in the plane that passes through the origin. Such a line can be given in 
geometric terms as the set of vectors [tv] ,t € К, where v is a vector of unit length 
lying in l. Now consider a point (p,q) which does not lie on 1 and consider the point on 
1 which is closest to 1. From Figure 4.2 it is clear that this point, call it tıv, has the 
property that the vector tıv must be perpendicular to the line passing through the point 
y = (p,q) and (фу). Thus the vectors y — tiv and tıv must be orthogonal 


(y —tiv,tiv) =0 (4.100) 
and solving for tı using (v, v) = 1 gives 
tı = (y, v}. (4.101) 
The vector 
Yp = (y, V) V (4.102) 
is called the orthogonal projection of y onto the line 1. Suitably generalized this con- 


struction is the geometrical basis for the method of least squares which we shall discuss 
in greater detail in Chapter 5. 


Figure 4.2: The Orthogonal Projection of y on 1 
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In that case we will consider the set of n-vectors Р = PS tj vj) called а hyperplane 


where [via are linearly independent and p € n. Again we suppose that y € К” and 
we want to find the point in the hyperplane P closest to y. Generalizing the geometry of 
the two dimensional case, this will occur when the vector y — >51 t;v; is orthogonal to 
P, and this is equivalent to having y — йа у; perpendicular to each v;,j = 1, 2,...,р 
Thus, the coordinates {t;} are found by solving the linear equations 


р 
Уу, (vi, Vj) i = (y, vi) E 1 < 1 < p- (4.103) 
j=l 


As we shall see, these are the generalizations of the least squares equations used for 
estimating the parameters in the simple regression model. 


4.5.2 Subspaces and Bases 


As with vectors, it is often convenient to discuss matrices in geometric terms. As we 
have already observed, we can view a matrix as composed of columns aj, 1 <i € p. If 
we consider the set of linear combinations of a;, 1 < i € р, 


p 
$=4) ба; ER, 1<j<p (4.104) 


j=l 


then these determine a hyperplane іп R” called the column space spanned by a;, 1 < i < р. 
More generally, if we have any sequence of vectors x;, 1 < i € p in R” the set of all linear 
combinations of x;, 1 <i € p, is called a subspace of R”. One can easily show that if S 
is a subspace, then it is closed under the operations of scalar multiplication and addition 
i.e., if x € S and c € R then cx € S and if x € S andy є 5, then x + y € S. Conversely, 
it can be shown that if S is any subset of R” closed under scalar multiplication and 
addition, then there is a set of vectors {x}?_, such that S is spanned by the x;'s i.e., it 
can be written in the form (4.104). Since the set {x;}?_, can contain redundant vectors, 
we consider the smallest subset of the x;'s which spans S. It can be shown that such a set 
of vectors is linearly independent. Such a set of linearly independent spanning vectors 
is called а basis for S. An important theorem in linear algebra states that all bases for 
a subspace S contain the same number of vectors. This number is referred to as the 
dimension of S, denoted by dim (S). For a matrix A the dimension of the column space 
is the rank of A. 

In general, if Sj and S2 are two subspaces of R” and S; С Sg, then dim (81) € dim (S2). 
In particular, dim (IR^) = n since R” is spanned by the canonical basis 


T 
eim (Ж; 1,940) ‚1<1<п. (4.105) 
А 


The canonical basis {e;};_; of R” has an important additional property, it is an orthog- 
onal basis; i.e., 


(ei, e;) = 0, ї + 7. (4.106) 


Orthogonal bases play an important role in regression analysis and it is important to 
know if a given subspace has an orthogonal basis. Fortunately, the answer is ‘yes’. 
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Moreover, this property can be established constructively by a process called Gram- 
Schmidt orthogonalization, which is important both theoretically and computationally. 


Theorem 4.7 Let S be a subspace of R”. If dim(S) = p < n, then S has an 
orthogonal basis. 


Proof. A full proof of the theorem can be found in [112]. However, we merely outline 
the general construction, called, as noted above, Gram-Schmidt orthogonalization. 

Since S is a subspace it has a basis у;,1 < i < p. If the v;’s are orthogonal, then we 
are done. If not, we construct an orthogonal basis w;,1 < ? < p, from the v;'s which 
spans S. Since orthogonal vectors are linearly independent, w;,1 < į < p, are a basis for 
S. 

The basic idea of the proof is to start with vı, and normalize it by dividing by |у; || 
giving wi = vi/||vil], (notice that ||wı|| = 1). Then at each stage having obtained 
W1, W2, ..., W} we compute the orthogonal projection, Projvj,; of у onto the span of 
{wh and then form 


Wii = Vi41 — Projvi4.. (4.107) 
Finally, 
уу! 
эы = q— (4.108) 
[wall 


We demonstrate this explicitly for vectors v1, V2, va. The projection of уо on ууу is 
given by (v2, wi) wi, then 


W = V2 — (va, W1) W1 (4.109) 
so that 
(w2, W1) = (w1, V2) = (v2, W1) (w1, W1) 
= (w1, V2) — (va, w1) = 0. (4.110) 


Hence, w; апа w% are orthogonal. Then, уз = w5/ ||w4||. Following (4.109) we define 
w3 by 


w3 = Уз — (Va, W1) W1 — (Va, W2) W2. (4.111) 
From (4.110) 
(муз, w1) = (v3, w1) = (va, w1) (w1, W1) ES (v3, W2) (w2, w1) 
= (va, W1) m (уз, W1) =0 (4.112) 


since wı and wz are orthogonal. Similarly, (w3,w2) = 0. Then, wi, wo, W3 = w3/ ||w3|| 
are an orthogonal basis for span (vi, V2, va]. 
In general, it follows from (4.112) that w;, у in (4.108) is given explicitly by 


l 


ы = УІ — ` (Vi+1, Wj) Wj. (4.113) 
j=l 


We leave it as an exercise to show that Wi 1<j < p, are orthogonal. Йй 
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4.6 Orthogonal Matrices 


We note that the vectors w;,1 < i € p, in Theorem 4.7 are not only orthogonal, but 
have the additional property that ||w;] = 1,1 <i € p. Any set of vectors u;,1 € i < p, 
in R” satisfying 


capat c rao. ш, 
(uu) = | 1, ї=] (4.114) 


is said to be orthonormal. 
If p = n, then (uj); ., is an orthonormal basis for R^. If uj,1 < i € n, is an 
orthonormal basis for R” then the matrix 


U = [ui|u2| - - : [un] (4.115) 


whose i-th column is uj is called an orthogonal matrix. Orthogonal matrices will play a 
significant role in this and later chapters. 


'Theorem 4.8 (Properties of orthogonal matrices) 
(i) A matriz U is orthogonal if and only if U^! = UT. 
(ii) If U is orthogonal, then UT is orthogonal. 
(iii) If Uy and Us are orthogonal, then U1Us is orthogonal. 
(iv) If U is an n x n orthogonal matriz, then for all x € R” 


(Ux, Ux) = (x,x). (4.116) 


(Thus, the length of Ux is the same as x. Hence, orthogonal matrices represent 
rotations in IR^.) 


(v) If U is orthogonal, then 
det (U) = +1. (4.117) 


Proof. (i) Consider the product UUT. Then, by properties of matrix multiplication 
it is easily shown that the ij-th element of UU? is (uj, uj) because the j-th column of 
UT is the j-th row of U. However, by assumption (4.114) (u;);.., are orthogonal so that 
UU? = I. Similarly, UTU = I, which shows that UT = U-!. 

On the other hand, if U-! = UT, then UU? = UTU = I, and (4.114) holds. Thus, 
an orthogonal matrix may be characterized by the property that U^! = UT. 

(ii) Since U is orthogonal, UT — U^! and taking transpose gives (uT)* == (0-1), 
But U = (U^!) ! so that (0-2) = (0-1)* which shows that U^ is orthogonal. 

(iii) It suffices to show that (U, U2)? = (U,U$) !. But 


(UU) = UTUT = Us! U7! = (0,0)! 


by (4.21) and (i). Thus, U; Ug is orthogonal if О; and U% are orthogonal. 
(iv) For this we observe from (4.88) that 


(Ux, Ux) = (x, UTUx) — (x,x) 
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since UTU = L,. (We note in passing that the converse of this result is also true. That 
is, if (Ux, Ux) = (x, x) for all x € R”, then U is orthogonal.) 
(v) By a well known theorem on determinants det (AB) = det (A) det (B), so that 


det (UU?) = det (In) = det (U) det (U7) = det? (U) (4.118) 


since det (A) = det (AT) for all n x n matrices. But det (L;) = 1, so that det? (U) = 1 
and so det (О) = +1. M 


4.6.1 Eigenvectors and Eigenvalues 


Since an arbitrary matrix can be a quite complicated mathematical object, it is often 
quite helpful, for both theory and computation, to be able to decompose a given matrix 
into simpler components, somewhat like factoring an integer into prime numbers. In 
matrix algebra many such decompositions exist, depending on the matrix involved. 

In regression analysis decompositions of symmetric matrices are crucial to much of 
the theory. For us, the most important decomposition is a result of the so called di- 
agonalization or spectral theorem which enables one to write а symmetric matrix A as 
a product of two orthogonal and one diagonal matrix. The columns of the orthogonal 
component are composed of an important set of vectors related to A, its eigenvectors. 

We now briefly take up this important topic. 


Definition 4.3 Let A be an n x n matrix. We say that a nonzero vector х Є IR" is 
an eigenvector of A if there exists a real number A € R such that 


Ax — Ax. (4.119) 


The number А is called an eigenvalue of A corresponding to x. 


From a theoretical point of view, one of the problems associated with Definition 4.3 
is that not every matrix А need possess an eigenvector. Geometrically, this may be seen 
if we think of an eigenvector as defining a line which gets mapped into itself under A. If, 
for example, A is a matrix corresponding to a rotation, then no line remains fixed and 
so А can have no eigenvector. 

For example, if 


0 1 
ке 
which represents a clockwise rotation of 90° then Ах = (y,—za)’ so that A?x = 
—t,—y Тош x. Thus, if x is an eigenvector with eigenvalue А, Шеп Ах = Ax so 
Б Б 


that А?х = ААх = Mx = —x which gives (А +1) х 20. If x Z0, then ^ +1 = 0 
and there is no real value of А satisfying this equation. 

In order to alleviate some of the difficulties associated with this problem it is more 
convenient to permit both x and to be complex. In this case, as is shown next, every 
matrix has at least one eigenvector-eigenvalue pair. 


Theorem 4.9 Let A be an nxn real matriz. Then, there exists at least one complex 
number А and complex vector x such that Ax = Ах. 
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Proof. For x to be an eigenvector of A we must have (A — АІ,) x = 0. Thus 
from Theorem 4.5 the matrix A — AL, cannot have an inverse. From the theory of 
determinants, this can happen if and only if c(A) = det (A — А1„) = 0. But as is shown 
in [112], det (A —AI,,) is a polynomial of eract degree n and from the fundamental 
theorem of algebra [112] this polynomial (called the characteristic polynomial of A) must 
have at least one complex root A. This root and a vector x such that (A — AI,,) х = 0 is 
an eigenvector-eigenvalue pair for A. M 


Since Theorem 4.9 shows that A has at least one eigenvalue and these eigenvalues 
are the roots of c (à) = 0, then an n x n matrix can have at most n distinct eigenvalues. 
However, it is possible for А to have fewer than n distinct eigenvalues. Moreover, since 
any scalar multiple of an eigenvector is also an eigenvector, the question of classifying the 
nature of the eigenvectors is even more complex and will not be dealt with in generality 
here. However, if A is symmetric, then things simplify considerably, and fortunately for 
regression analysis, this is the case of most importance. We take this up next. 


4.6.2 The Spectral Theorem for Symmetric Matrices 
Theorem 4.10 Jf A is an n x n symmetric matrix, then 
(i) all the eigenvalues of A are real; 
(ii) eigenvectors corresponding to distinct eigenvalues are orthogonal. 


Before proving Theorem 4.10 we need to extend the notion of an inner product to 
complex n-vectors. If x = (21,22, eer is a vector with each component a complex 
number, then X = (71,75, ms is called the complex conjugate of x, where т; is the 
complex conjugate of z;. The inner product of two complex vectors x, y, is then given 


by 
(х,у) = У rT; (4.120) 
ї=1 


which reduces to (4.85) if both x and у are real. The properties of the complex inner 
product are similar to those of Theorem 4.2 and are listed below. 


(i) (x,y + 2) = (x, y) F (x, Z) ; 

(ii) (х,у) = (у, х); 

(iii) (x, cy) = с(х,у), where с is a complez number and (cx,y) = с(х,у); 
(iv) |(x,y)| < [lx [у], where [11 = (Zi: 2:2:) ^^. and [у = (ha wi) 


Proof of Theorem 4.10. (1) Since A is an eigenvalue (possibly complex) of A, 
then Ax — Ax. Thus, (x, Ax) — (x, Ax). From the property (iii) of the complex inner 


product (x, Ах) = А (x, x) and using the fact that A is real, 
(x, Ax) = (A"x,x) = (Ах, х) = (Ax, x) = A(x,x). (4.121) 


1/2 


Thus, А (x, x) = А (x, x) and since x Z 0, (x, x) Æ 0, which gives А = А. But a complex 
number equals its complex conjugate if and only if it is real, so A is real. (We note from 
this, that since Ax — Ax, that x may be chosen to be real as well.) 
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(ii) Suppose that Ах; = A1x1 and Ах» = A2x2 where A1 Æ А. Then, (Ax),x2) = 
(A1x1, X2) and (Ax2,x1) = (A2x2,x1). Now, since x; and хә are real, 


(AX1, X2) = (xı, A sx) = (x1, Ax?) = (Ax2, X1). (4.122) 
Thus, subtracting gives 
(A1X1,X2) — (A2X2, X1) = А (X1, X2) — А (х2,х1) = (Ау — А) ((х1,х2)) = 0. 


Since Ау Æ А, (x1, x2) = 0 showing that x; and x2 are orthogonal. W 


As we pointed out above, an n x n matrix А can have anywhere from one to n 
linearly independent eigenvectors. We will now examine what kinds of matrices can have 
n linearly independent eigenvectors. 

Suppose this is true for A, then there exist n linearly independent eigenvectors 
X1,X2,..., X4, corresponding to the eigenvalues Ау, А, ..., An which are not necessarily 
distinct. Thus, Ax; = А;х;,1 € i € n. Let U denote the matrix whose columns аге 
X1, X2, ..., Xn arranged in that order. Thus, О = |х; |х| ··· [x4] and AU = [Ax,|Axg]--- 
|Ax,] = [A1xilA2x2] · · - |AnXn]. Now it is easily shown that [A1x1|A2x2 ·· · |À5x4] = UA, 
where А = diag(A;),1 € i € n. Thus, 


AU = UA (4.123) 
and since U is invertible (it has rank n) 
A=UAU™. (4.124) 


Thus, A can be decomposed as the product of a diagonal matrix, an invertible matrix 
and its inverse. In the language of linear algebra [112] it is customary to say that A is 
similar to the diagonal matrix A. On the other hand, if (4.124) is true, then A must 
have n linearly independent eigenvectors. Although it is known from linear algebra which 
matrices have property (4.124) the only situation that will concern us is that when A is 
symmetric. That symmetric matrices are similar to diagonal matrices is one of the most 
important facts of matrix algebra and as we will show, this result, the spectral theorem, 
has numerous applications in regression analysis. 


Theorem 4.11 (Spectral theorem) Let A be an n x n symmetric matriz, then A 
is similar to a diagonal matrix. Moreover, as a consequence of the orthogonality of the 
eigenvectors of A,U in (4.124) can be chosen to be orthogonal so that 


A = UAU", (4.125) 


where the columns of U are orthonormal eigenvectors of А and A is the diagonal matriz 
of eigenvalues of А. 


Proof. We will show, using an inductive argument, that there exists an orthogonal 
matrix U such that UT AU = A. From this (4.125) follows and by using the argument 
immediately following Theorem 4.10 it follows that the columns of U are eigenvectors of 
A. and the diagonal elements of A, the corresponding eigenvalues. 

If A is a 1 x 1 matrix [x], then the theorem is trivially true by taking xı = (1). 
Suppose then that the theorem is true for all (n — 1) x (n — 1) symmetric matrices. Now 
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if A is n x n, then it has at least one real eigenvalue А! and corresponding eigenvector 
хі, such that |х. || = 1. 
By the Gram-Schmidt process there exist vectors Ue, us, ..., Un such that x4, U2, ..., Un 
is an orthonormal basis for R”. Let 17; = [x;[u;|...|u4] and observe that UTU, = I, 
since U, is orthogonal. 
Now let 
Bı = UT AU], (4.126) 


then В, has the form 


В; = (4.127) 
where Ag is an (n — 1) x (n — 1) symmetric matrix. 
To see this, consider 
AU, = [Ax,|Aug]...,/Au,,] = [Aixi|val...Iv«] (4.128) 
where v; = Au;. Thus, 
B; = UT AU, = [AUT x |U] v;]...[U1 ул]. (4.129) 
Since x, is the first column of U; and U; is orthogonal, 
E 
GI 
T f T T 
Ui хі = хі = [(х1,х1) (u1,x1) E (Un, x1)] = (1, 0, ...,0) (4.130) 
un 
so that B, can be partitioned as 
В; = (4.131) 


Since A is symmetric, then BT = UTATU, = UTAU, = B; so that B, is symmetric 
as well. Thus, from (4.131) a2 = a3 =... = Qn = 0 and Аз must be an (n — 1) x (n — 1) 
symmetric matrix. Thus, 


(4.132) 
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Applying the induction hypothesis to A2, there exists an (n — 1) x (n — 1) orthogonal 
matrix Us such that ОГ A;U; = A» where A» is diagonal. Now define U2 in block form 
by 


(4.133) 
where U% is easily shown to be orthogonal. Now 
В.05 = 
= (4.184) 
as may be verified by the block multiplication rules of Section 4.4. Thus, 
UZB,U, = 
= . (4.135) 


Using Bı = UT AU; in (4.135) 


UTUTAU;,U, = 


(4.136) 
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Since U4U; is an orthogonal matrix, letting U = U, U% in (4.136) we find that 
UTAU=A (4.137) 


where A is a diagonal matrix. This gives, as we observed above, the proof of the theorem. 
a 


As our first application of the spectral theorem, we give a further decomposition of a 
symmetric matrix when the matrix is positive definite. 


Definition 4.4 Let A an n x n symmetric matrix, We say that A is positive definite 
if and only if for every x € R",x Æ 0, (x, Ax) > 0. If only (x, Ax) > 0,x Æ 0, then A is 
said to be positive semidefinite. 


Theorem 4.12 (Properties of positive definite matrices) 
(i) If A is positive definite, then A is invertible. 
(ii) A is positive definite if and only if all the eigenvalues of A are positive. 


(iii) If A is positive definite, then there exists a nonsingular matriz R such that A = 
RR’. 


Proof. (i) Suppose A is singular. Then there is a nonzero vector x € R” such 
that Ax = 0. Thus, (x, Ax) = (x, 0) = 0 and this contradicts the assumption that A is 
positive definite. 

(ii) Suppose A is positive definite and A is an eigenvalue of A. Then Ax — Ax, so 
that (x, Ax) = A (x, x). Since x Æ 0, à = (x, Ax) / (x, x) > 0. 

On the other hand, suppose that А is symmetric with positive eigenvalues. Then 
by the spectral theorem A = UAU! where A = diag(A;) and A;,4 = 1,2, ...,n, are the 
eigenvalues of A. Then, if x Æ 0, 


(Ax,x) — (UAU?x, x) = (AUTx, U7x), (4.138) 
Let z = U?x, then (Ax, x) = (Az, z) = У) ^iz? > 0, since А; > 0 and z z 0. 
(iii) From the spectral theorem and (ii), A = UAUT where again A = diag(A;) and 
№; > 0,1 € i € n, are the eigenvalues of A. Let VA = diag(V/À;), then 
А = UVAVAUT = RR? (4.139) 


where R= UVA. B 


4.6.3 Some Further Applications of the Spectral Theorem 


Using the spectral theorem we can establish a number of useful properties of a symmetric 
matrix А and its eigenvalues. 


Theorem 4.13 Jf А is an n x n symmetric matriz and \;,1 < i < n, are the 
eigenvalues of A, then 
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i) the eigenvalues of A? are А; 


iii) tr(A) = У Az; 


( 
(ii) if A is invertible, then the eigenvalues of А! are 1/A,1 <i <n; 
( 
(iv) det (A) = [];_, л. 
Proof. (i) From the spectral theorem A = UAU” where A = diag(A;). Using the 
orthogonality of U it then follows that 


А? = UAP UT. (4.140) 


Hence, APU = UA?. Thus the columns of U are eigenvectors of A? with eigenvalues 
AP, 1 € i € n. Since an n x n matrix has at most n eigenvalues, all the eigenvalues of A? 
are of the form A? where A; is an eigenvalue of A. 

(ii) This follows as for (i) using the fact that А! = UT A-!U since A^! is symmetric 
and A^! = diag(1/.,). 

(iii) Again using the spectral theorem and property (iv) of the trace we get 


tr (A) = tr (олот) = tr(U*UA) =tr(I,A) = (А) = У Мм. (4.141) 


From (4.141) and (ii) it follows that tr(A7?) = $77 , 1/А;. 
(iv) Using the fact that the determinant of a product of matrices is the product of 
the determinants 


det(A) = det (олот) = det (U) det (UT) det (A) 


det (uU") det (A) = det (In) det (A) = П X (4.142) 
i—l 


since the determinant of a diagonal matrix is the product of its diagonal elements. M 


Using Theorem 4.12 we prove the important fact that if A is an orthogonal projection, 
then 
tr (А) = rank(A). (4.143) 


Since A? — A and A is symmetric, it follows from Theorem 4.13 that the eigenvalues of 
A satisfy X = М,1 << тп. Thus, each А; is either 0 or 1. Hence, 
UTAU=A (4.144) 


where A is a diagonal matrix with p ones and n — p zeros on the diagonal. Since mul- 
tiplication of a matrix by a nonsingular matrix does not change its rank, the rank of A 
is the rank of A. But the rank of A is p since it’s column space is spanned by the p 
columns which correspond to the ones on the diagonal. Thus, 


rank (A) = rank (A) = p= tr (A) = tr (A). (4.145) 


156 CHAPTER 4. RANDOM VECTORS AND MATRIX ALGEBRA 


4.6.4 Expectation of Quadratic Forms 


As an another application of matrix techniques in this Chapter we calculate the expected 
value of a quadratic form of a random vector. Such expectations will be quite useful in 
the following Chapter. 

Let (Y1, Yo,...,¥,) be n random variables. Collectively we will call them a random 
vector Y, 


Definition 4.5 Let Y = (у, У, Ed be а random vector. The mean vector of 
Y, E(Y) is the vector 
E (Y) = [E (Y1), E(Y2), E (Ya )]" (4.146) 


provided the expectations exist. Also, the variance-covariance matriz of Y, У СҮ) is the 
n х n matrix 
У (Ү) = [Сох (У;,У;)|,1< 4,3 <n. (4.147) 


Note: The diagonal elements of У (Y) are Cov (Y;, Y;) = Var (Y;). 


Definition 4.6 Let A = [а;;], i,j = 1,2,...,n be an n x n symmetric matrix, and let 
Y = (Yi, Y2, ..., Yn) be an n x 1 random vector. If Z = (Y, AY) = У bi-j a Ү), 
we say that Z is a quadratic form in the Y;'s. 


Theorem 4.14 Let Z = (Y, AY) be a quadratic form and assume that E(Y) = p 
and X(Y)- E = [0;;], 4,3 = 1,2, ..,n. Then 


E(Z) = tr(AX) + n! Ap. (4.148) 


Proof. E(Z)-9Y У Gij E (Y;Y;). Since Cov (Y; Y;) = E (Y;Y;)- E (Y) E(Y;) 
= E (Y;Yj) — ши, E(YiYj) = Cov(YiY;) + шш = vij + шш. Thus, 


n LO TL n 
E(Z) = 9 aijCov(Y:Y;) + у у aiu; (4.149) 
Now using the definition of the trace and the fact that tr(AZ) = 577 571 aijouj, we 
have E (Z) =tr(AX)+plAp. Ш 


We now turn our attention to the application of some of these matrix results to some 
basic results in probability theory, the first being a short discussion of the multivariate 
normal distribution. 


4.7 The Multivariate Normal Distribution 
4.7.1 The Nondegenerate Case 


Definition 4.6 We say that a random vector Y = (Yi, Y2, mo has a multivariate 
normal density fy (y) if and only if 


Ј (у) = ОЙ т ехр E (y - р, E`! (у – 22) ‚у € R^, (4.150) 
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where у = (yi, yo, Ed € R^, w= (pi, bo, p € R” and У is a positive-definite 
symmetric matrix. It is customary in this case to say that Y has a nondegenerate 
multivariate normal distribution. In this case we write 


Y~N(p,5). (4.151) 


Theorem 4.15 (i) Let Y have a joint multivariate normal density. Then fy (y) as 
given by (4.150) is a density. That is, fy (y) 2 0 and fy. fv (y) dy = 1. 


(it) Y has a joint multivariate normal density if and only if Y = AZ + џи, where A 


is n x n and nonsingular and Z = (Zi, 22, "md is a vector of n independent 
М (0,1) random variables; i.e., Z ~ N (0, In). 


(iii) Let E (Y) = [Е (Y1), E (Y) ,..., E (Yn) be the mean vector of Y and X(Y) = 
[Cov (Y;, Yj),1 € à, j € n, be the variance-covariance matriz of Y. Then, E (Y) = 
u and Xi (Y) = X. 


(iv) The random variables Ү;,1 € i € n, are independent if and only if they are uncor- 
related; i.e., [Cov (Y;, Y;)] = diag(o?), where o? = Var (Yi). 


(v) The moment generating function of Y is given by 
My (t) = exp ((t, 1))) exp ((t, Et) /2) (4.152) 
where t = (1,12, mi c №". 
(vi) If Y ~ N(0,L,) and U is an orthogonal matriz, then UY ~ N (0, In). 


Note: 'Theorem 4.15 gives the main properties of jointly distributed normal random 
variables that we will need in this text. À more comprehensive discussion may be found 
in references such as [40, 63]. 


Before proving Theorem 4.15 we prove a number of additional properties of random 
vectors. 


Theorem 4.16 Let Y be a random n-vector and let A be an m x n matrix. Then, 
if Z is a random m-vector 


(i) E (AY + Z) = AE (Y) + E (Z). 
(ii) E (Y +b) = X (Y), where b is a constant n-vector. 
(iii) X (AY) = AX (Y) AT. 


Proof. (i) Now the i-th component of AY + Z is 57; ,a;Y; + Zi where A 
(4],1 < i € т,1 < j € n. By linearity of expectation E (Sofa an) +2) 
jel &;Е(Ү;) + E (Z:),1 <i < m. But this is the i-th element of AE (Y) + Е (2). 


I 
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(ii) The Zj-th element of У (Y + b) is given by 
Cov (Y; + bi, Y; +b;) = Е|[(Ү;+Ь;)(Ү; +5;)] - E (Y; + bi) E (Y; + b;) 
= E[(X,Y;)-- bjE (Y;) + bi E (Y;)] + bib; 
- [E (У) + bi] [E (Y;) + bj] 
= Е(ҮҮ;) +ЬЕ(Ү;) + bi E (Y;) + bib; 
-E (Y;) E (Y;) -&E(Y) — ЬЕ (Y;) - 
E (Y;Y;) - E (Y1) E (Y;) = Cov (Yi, Y;) (4.153) 


which is the ij-th element of У (Y). 
(iii) Now the i-th element of X = AY is 577 , a;;Y;. Hence, using (2.74) 


n 


УУ ajrauCov (Ур, И) 


Cov (Xi, X; = 
l=1 k=1 
п п 
Ж x. 5 Q jkGilO к. (4.154) 
1=1 k=1 
Also, 
п, 
(АХ), = У аксы = bi. (4.155) 
k=1 


Denoting the /j-th element of А? by cj; = ад the ij-th element of АУА? is given by 


TL TL 
) бис; = У > Озок | ад = У у акад 
1=1 k=1 


1=1 1=1 k=1 
= Cov (Xj, Xi) = Cov (Xj, Xj). (4.156) 


Proof of Theorem 4.15. (i) Obviously fy (y) > 0, so to show that fy (y) isa 
density, it suffices to show that it integrates to one. To do this use Theorem 4.12 to write 


X = RR”. Then, 
(y - p, (ват). (y – 2 
= (R` (y-n), R` (y - p)) (4.157) 


Vdet = | det (RR7) = |det В]. (4.158) 


Let z = Вг! (у — и) and using the change of variables formula for multiple integrals [40] 
| fx (y)dy = E fv (Rz + и) |det В] dz 
R? 
2 

= | DE) iana 

R^ ) 

im 

( 


(у= 7 (у – p)) 


апа 


(2х)"/? det R| 
= |. exp [— (2,2) /2] 
ENT 


E та ка ы р. (4.159) 
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But (z,z) = у. | 22, so that 


esten, _ f pylon] 4, 
A (25 ^? = f. = pe! (Qn)? | dz; 
П 


a E SP. ani (4.160) 
1 


since each of the integrals in (4.160) is the integral of a standard normal density. 

(ii) Making use of arguments similar to those in (i) we find that Y = RZ + u where 
X = RR? and Z ~ N (0, In). The details are left as an exercise. 

(iii) From (ii) E (Y) = E (RZ + p) = RE (Z) + E (p) = p, since E(Z) = 0. Also 
from Theorem 4.14, 


|| 


i= 


=(Y) (RZ +p) = X (RZ) 


RX(Z)R? = RI,R™ = КВТ = X. (4.161) 


| 


(iv) Suppose that (Yi, Yo, .., Yn)” are N (p, ©) and are independent. Then 


а; ї=] 
0, T (4.162) 


On the other hand, since X is the variance-covariance matrix of Y and Y;,1<i<n, 
are uncorrelated, then У = diag(o2), X^! = diag(1/o?) and 


Со) = | 


exp [(у – р, X^! (y — u))] /2 
ME IIa o: 


j—1 


fy(y) = 


о. 


Since the joint density of Y factors as the product n N (T at) densities, it follows that 
Ү;,1 < i < п, are independent № (ш, о?) random variables. 
(v) By definition, the moment generating function of Y is 


My (t)=E [e| | (4.164) 


By (iii) Y = RZ + u where RR? = X so that 
(t, Y) = (t, RZ) + (t, u) = (КТ, Z) + (t, и) = (s, Z) + (и) (4.165) 


where s = R7t. Thus, 
My (t) = et) E Cu) | (4.166) 


But (s, Z) = $5.4 8:2; so that Е(е'®®)) = ТТ, E (eZ) where 2,1 € i < n, are 
N (0,1) random variables. 


160 CHAPTER 4. RANDOM VECTORS AND MATRIX ALGEBRA 


From Section 2.7 


E (6622) 


Пе (82/2) = exp p an) 


= exp((s,s) /2) = exp ((R^t, R^t) /2) 
exp ((t, КЕТЕ) /2) = exp ((t, Et) /2). (4.167) 


Combining (4.166) and (4.167) we find that 
My (t) = exp ((t, и))ехр ((t, Bt) /2) . (4.168) 


(vi) We establish (vi) by calculating the moment generating function of W — UY. 
From (v) My (t) = е'®*®9/?, since Xi (Y) = In. Thus, 


E (eeu) -E ш 


2(UTt,U7t) /2 = e UUTt)/2 e e 0/2. (4.169) 


Mw (t) 


since UU? = L,. Thus W has the same moment generating function as Y and so W has 
the same joint distribution as Y. Thus W = (W1, W2, ..., Wn)” are independent N (0, 1) 
random variables. Ё 


4.7.2 The Degenerate Multivariate Normal Distribution 


In order to develop the distribution theory associated with multiple regression models 
with normal errors, it is necessary to generalize the multivariate normal distribution to 
the case where X is singular. In this case Y will not have a density so (4.150) will 
not apply. However, if we use the well known fact from probability theory that the 
distribution of many random variables may be characterized in terms of their moment 
generating functions, then we can use (4.152) to define joint multivariate normal distrib- 
utions with singular variance-covariance matrices. Such distributions are usually said to 
be degenerate since they do not have densities. 


Definition 4.7 Let Y = (Yi,Y5,..., y: be an n-vector of random variables. We say 
that Y has a degenerate multivariate normal distribution if and only if the joint moment 
generating function of Y is given by 


My (t) = exp ((t, и) + (t, Et) /2). (4.170) 


where © is a positive semidefinite matrix. 
As before, we write Y ~ N (p, ©). 


Theorem 4.17 Assume that Y has a degenerate N (p, X) distribution. Then, 
(i) Y = (RZ + и), where X = RR? and Z is N (0,L,). 
(ii) E (Y) =p and X (Y) = X. 


(iii) If Var (Yi) = с? > 0, then Y; is N (pj, o2) . Otherwise Y; is a degenerate random 
variable with P (Y; = p} = 1. 


4.8. SOLVING SYSTEMS OF EQUATIONS 161 


(iv) Y;,1 <i € n, are independent random variables if and only if they are uncorrelated. 


Proof. (i) To prove this we find the moment generating function of Y. Straight- 
forward manipulations as in Theorem 4.15 show that the moment generating function of 
RZ + и is given by (4.170). Thus, RZ + и and Y have the same distribution. 

(ii) This follows as in Theorem 4.15. 

(iii) The marginal distribution of Y; is given by setting t; = 0,7 Æ i, in (4.170). In this 
case, (t, Xt) = o?t? where c? is the i-th diagonal element of X and (t, р) = p;,1 <i<n. 
Thus, 

My, (ti) = еі eit 2 (4.171) 


1 


If c? > 0, then Y; is N (u;, 07). On the other hand, if o? = 0 then 
My, (ti) = е (4.172) 


and this is the moment generating function of a degenerate random variable D; with 
P (Dij = uj) = 1. Then Y; has the same distribution as Dj. 
(iv) Using (iii) this follows as in Theorem 4.15. M 


4.8 Solving Systems of Equations 


As we have seen in Chapter 3, the computation of parameter estimates in the simple 
linear regression model requires the solution of two linear equations in two unknowns. 
For the models to be treated in subsequent chapters involving m > 2 parameters this 
will require the solution of m equations in m unknowns, which can be tedious, even for 
moderate values of m. Although most modern statistical packages treat this computation 
as a “black box”, we feel it is useful for students to see how this can be done and perhaps 
write their own programs. (It is the authors’ experience that these black boxes do not 
always work for all problems and one may sometimes have to do it himself/herself.) 
Аз we observed at the beginning of the Chapter, the set of m equations in m unknowns 
(L1, E23., ы) Е 
NC = bi, 1 < 1 < Th, (4.173) 
j=l 


can be written in matrix-vector form as 
Ax=b (4.174) 


where A = [0;;],1 € i,j < m, is the coefficient matriz, x = (21,22, Em). and 
b = (b1, 55, ..., hos . Now (4.174) has a unique solution if and only if A is invertible and 
then 

x—A b (4.175) 


solves (4.174). Thus, if we can compute A^, then (4.175) gives as the required solution 
x. Unfortunately, computing A^! is usually а more difficult task than solving (4.174) 
directly, so one does not usually proceed in this fashion numerically. For small values of 
m it is possible to use Cramer's rule [112], but again, this is usually not computationally 
desirable for most practical problems. 
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Historically, the most widely used technique for solving (4.174) has been Gaussian 
elimination which is a generalization of the well known elimination method taught in high 
school algebra. This method is still widely used, for both statistical and nonstatistical 
problems [112, 120] and is generally quite efficient and accurate for moderate values of 
m, provided A is well-conditioned (see Section 4.9 for a definition) and can be readily 
programmed. Because the coefficient matrices in regression analysis are of the form 


А = XTX, (X isn x m) (4.176) 


they will be positive definite if X has full rank. In this case it is often useful to exploit 
this property to produce more efficient and stable algorithms and we shall discuss several 
of these as well. 


4.8.1 Gaussian Elimination 


With Gaussian elimination, as well as many other techniques, the goal is to reduce the 
system (4.174) to an equivalent triangular system which can be solved by backward or 
forward substitution. For example, if 


ауу @12 413 
A=] 0 а azg |, (4.177) 
0 0 033 


then (4.174) becomes 


(1) а + Qie%2 + 21323 = 1, 
(2) 42222 + 42323 = b2, 
(3) 43323 = b3. 


One then proceeds by solving (3) for x3 substituting in (2) solving for x2 and finally 
substituting 23, 22 into (1) to get zı. Clearly, this process can be generalized when A is 
upper triangular 


аі] 9012 ‘ат 

О аә ‘++: ат 
И" (4.178) 

0 0 атт, 


In Gaussian elimination we first reduce A in (4.174) to the form in (4.178) and then 
the resulting system is solved by back substitution. The reduction is carried out by 
successively eliminating variables in all equations below the first. For simplicity, we 
illustrate this process for m = 3. 

Hence, we consider the following equations in three unknowns, 


(1) апа + 41222 + 213273 = b1, 
(2) адхі + 42229 + 02313 = b2, 
(3) 3121 + a32%2 + 03323 = b3. 
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We begin by trying to eliminate the variable x, in equations (2) and (3). We do this 
by adding suitable multiples of (1) to (2), and then to (3). To eliminate zı in (2) we 
must make a2; = 0. Thus we need to determine c so that azı + сауу = 0. If aj, #0 
(which we assume) then с = —a21/a11. Thus adding c x (1) to (2) the second equation 
becomes 


(2/) 021 + lan — (=) Za хә + Ё = (22) m| тз = bo — (=) bi. (4.179) 
à11 Q11 @11 


Similarly, eliminating zı from (3) gives 


(3) 021 + fesz =< (=) an] T2 + fass = (=) z їз = b3 = (=) bi. (4.180) 
а11 а11 011 


The transformed equations now take the form 


(1) аш + 831222 + 04323 = bi, 

(2^) 05542 + 5343 = 00, 

(3) 055172 + Qt = 3, 
where ahs, à54 etc. are determined as in (4.179)-(4.180). This procedure may now 
be repeated, eliminating 2 from the last equation by adding the multiple —a55/a5, 
(2') to (3). This results in the system 


) а121 + 01222 + 01373 = 1, 
it / / / 
(2 ) 59232 + Q9373 = bs, 
(3") аз373 = 03. 


If the original equations have a unique solution for every right hand side, this рго- 
cedure always works theoretically, subject to the following proviso. Note that at each 
stage we had to divide by the pivot elements on the diagonal. So these elements need 
to be non-zero. This may not always be the case, but can always be accomplished by a 
row interchange because of the unique solvability of the equations. Thus, an additional 
ingredient of the algorithm requires one to provide a mechanism to search for a non-zero 
pivot element at each stage. One of the nice features about the least squares equations 
is that because X7 X is positive definitive, this guarantees that successive pivot elements 
are non-zero. However, in many problems some of the pivots can become very small and 
then dividing by these can produce substantial round-off error. Hence, one often uses a 
partial pivoting strategy of searching the leading elements in the pivot column below the 
current pivot a;; for the largest element (in absolute value) and then interchanging that 
row with the i-th. 


Example 4.2 To illustrate some of these ideas we solve the following equations by 
Gaussian elimination 


(1) 251 +22 +23 = 4, 
(2) 21 + £2 + 23 = 4, 
(3) 321 + 2z3 + z3 = 6. 


Eliminating 21 in (2) and (3) gives 
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(1) 25 +25 + T3 = 4, 
(2^) 22/2 + 23/2 = 2, 
(3^) 22/2 — z3/2 = 0. 


Then, eliminating х2 in (3) we get 


(17) 22, +22+ 23 = 4, 
(273 12/2 + 23/2 = 2, 
(3") —73 = —2. 


Solving these by back substitution gives: тз = 2, 22/2 = 2 — 1 or z2 = 2; and 
2171 = 4— 2—2 = 0 ог ху = 0. Hence, the solution is: zı = 0, 22 = 2,23 = 2. 


Example 4.3 Solve the following system of equations by Gaussian elimination. 


(1) то + T3 = 2, 
(2) +22 = 2, 
(3) а +аз = 2, 


Here ату is zero, so we begin by interchanging (1) and (2) to get 


(1) a, +22 = 2, 
(2) T2 + 73 = 2, 
(3’) ті + 23 = 2. 


Eliminating xı in (3 ) by subtracting (1) from (3) gives 


(1”) т +29 = 2, 

(2") тә + £3 = 2, 

(3”) —z2+ z3 = 0. 
Adding (2") to (3") gives: 2x3 = 2, so хз = 1; and by back substitution into (2") 
gives 22 = 1; and again putting тә into (1") we obtain xı = 1. Hence, the solution 
is: тү = 1,то = 1, z3 = 1. 


For numerical purposes manipulation with the unknowns {z;} are superfluous. All 
the calculations can be done by performing the successive operations on the augmented 
coefficient matrix. 


Q11 d Qin by 
a12 . b2 

[Abl-| | (4.181) 
Gm1 iudi: Gmm bm 


The elimination steps are performed оп [A|b] until it becomes (after m — 1 steps) 


oO ss p agn 
0 à ue jo 


(4.182) 


Ut o deut aS 
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and then the resulting equations are solved by back substitution. 
An important consequence of the Gaussian elimination process is that it is equivalent 
to factoring A as [112] 

A=LU (4.183) 
where L is a lower triangular matrix formed from the multipliers used to do the elimi- 
nation and U is the coefficient matrix of the final upper triangular system. Such matrix 
factorizations are fundamental in modern numerical analysis. 


Example 4.4 If we consider the system in Example 4.3 we find the matrix of 
multipliers is given by 


1 0 0 
=|1/2 1 0 
3/2 1 1 
and 
2 1 1 
Ш= |0 1/2 1/2 
0 0 -1 


Straightforward matrix multiplication shows that A = LU. 


Computing A^! 


Because regression analysis requires not only parameter estimates but standard errors and 
correlations, this requires obtaining A^! as well. This can also be done using Gaussian 
elimination or equivalently the LU factorization. 

If A is an m x m invertible matrix, then A^! is the unique matrix solving 


AA! = Im- (4.184) 
If we write 
АС = [ху|хә|...|х,) (4.185) 
then (4.184) becomes 
[Ax,|Axg]...|Ax,] = In. (4.186) 


Since е i-th column of Im is the i-th canonical basis element e; (4.186) is equivalent to 
solving the m systems of equations 


Ах=е;,1<1<т. (4.187) 


Again, these systems сап be solved by Gaussian elimination by performing elimination 
on the matrix 


[A]Im]. (4.188) 
Equivalently, if one has the LU decomposition, then (4.187) becomes 
LUx; =e, 1<i<m. (4.189) 
These can be solved by setting z; = Ux;, 1 € à € m and solving for 2; from 
Lz; = ei (4.190) 
and then x; is obtained from 
Ux; = zi. (4.191) 


This is easily done by forward substitution in (4.189) and backward substitution in 
(4.190). 
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4.8.2 Cholesky Factorization 


As we observed above, Gaussian elimination does not use any properties of A other than 
its invertibility. When A is positive definite, other factorizations may be more convenient 
and numerically less prone to round-off error. One of the most important of these is the 
Cholesky factorization 

A=LL? (4.192) 


where L is a lower triangular matrix. This factorization can be obtained by assuming L 
in the form 


lı 0 e 0 
lı i» 0 . 

P | | ИП 0 (4.193) 
Id l2 эж lmm 


and then using (4.192) to solve explicitly for l;;. We illustrate this process for a 3 x 3 
matrix. (Note: Because A is symmetric we only need the lower triangle of LL’ .) 
Now, 


li 0 0 hi i із 
LL? = i31 l2 0 О log ls 
зі l32 laa 0 0 1a 
2, 111121 111131 
= lihi B +18, 1211з1 + l22l32 (4.194) 


laii l3ilo + l22l32 12, + 12, + 123 


Equating elements in (4.194) gives Bi = а 50 that /;; = „/a11. (This makes sense 
since the diagonal elements of a positive definite matrix are positive. In fact, aj; = 
(ei, Ae;) » 0.) Equating elements in the second row gives 


151111 = азу and 121 + i2; = (225. (4.195) 


Thus, 151 = a21/lii, [25 = 0229 — a2, алі = (а11а22 — a21) /211- If follows from the 


positive definiteness of A that a11@22 — a2, > 0 so that [22 = /(a11a22 — a2) /ai1. 
Last, equating elements the third row gives 


lailii = 031; l31l21 + l22l32 = 032 and I + I2 + i5 = 033 (4.196) 


and these can be solved successively for /31, /32 and [33. 
For a general m x т positive definite symmetric matrix this process can be extended 
giving the i-th row lij, j = 1,2,...,i — 1 as [112] 


— i 
pcc шш ЭРЕ 2 а= lik iE ge... (4.197) 
15) 
As with the LU factorization, the Cholesky factorization can Бе used to solve (4.174). 
In this case 
LL’x=b (4.198) 
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and setting z = LTx, Lz = b. As before, z can be obtained by forward substitution and 
x by back substitution. The inverse of A can be found as for the LU decomposition by 
solving 

LL? x; =e; 1<i<m. (4.199) 


The Cholesky factorization is important not only for solving systems of equations, 
it may be needed to determine other quantities as well. For example, in many sta- 
tistical calculations one requires the determinant of A. The Cholesky factorization 


gives a convenient way of doing this. In fact, det (A) = det (L17) = det? (L) since 


det (L) = det (L7). But the determinant of a triangular matrix is the product of its 
diagonal elements so that det? (L) = П", 12, > 0. Historically, det (A) was often used as 
a measure of conditioning of A; small values indicating ill-conditioning, so the Cholesky 
factorization gives a convenient way of doing this. We shall return to this topic in Chapter 


9. 


4.9 The Singular Value Decomposition 


As we have already seen, various factorizations of a square matrix A are important for 
computing a variety of quantities and for a theoretical understanding of properties of A. 
It is of interest to know if similar decompositions exist for rectangular matrices, since the 
design matrices X in regression analysis are generally not square. A factorization which 
is playing an increasingly important role in scientific calculations is the singular value 
decomposition (SVD), which may be viewed as a generalization of the spectral theorem 
for square matrices. 


Theorem 4.18 (Singular value decomposition) Let А be a real n x m matrix. Then 
there exist two orthogonal matrices U, V such 


VTAU-F (4.200) 


where F is a diagonal rectangular n x m matriz of the form 


(4.201) 


where F;; = u;i = 1,2,...„т. The numbers p;,1 € i € т are called the singular values 
of A. They ате real and positive and can be arranged so that p; > po >°- > р, > 0 
where r is the rank of А. 


Proof. We give an outline of the proof, full details can be found in [112, 103]. Con- 
sider the square m x m matrix АГА. It is symmetric and generally positive semidefinite. 
From the spectral theorem there is an orthogonal matrix U such that 


UT AT AU = diag (41, Ag, ..., Ar; 0, 0, ..., 0) (4.202) 
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where the elements of diag(A1, А, ..., Ar, 0,0,...,0) are the eigenvalues of АТА. Hence, 
М > 0,1 € i € т, and can be arranged in decreasing order А, > № >... > A,. Define 
ш = VAG, 1 < 1 € п, and let 


D = diag (шу, H2, ..., Hr, 0, 0,...,0). (4.203) 
Hence (4.202) can be written as 
UT AT AU = D?. (4.204) 
Letting W — AU (4.204) becomes 
w^w-Dp?(Wisnxm). (4.205) 
Letting w; € R",1 € j € m, be the j-th column of W it then follows from (4.205) that 


2 . 
Pis 1 < 1 < T 
(wj, wj} = { ПИ (4.206) 


апа w; = 0 for j > т. From (4.205) it follows that w;,1 < j < т, are orthogonal, hence 
the first т columns of W are linearly independent so that т < n. 
Define =n 
ае и (4.207) 
j 
From (4.206) this is an orthonormal set in R”. Using the Gram-Schmidt process this can 
be completed to an orthonormal basis v;,1 <j < n for R”. Let 


V = [vi|val --: [vn] (4.208) 


and observe that VF = W. 
In fact, 


VF = [vi |м |... Vn] 


Vil 712 9 Ulin 
Фор V22 77 | U2n 


Un1 Un2 tees Unn 
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116] UVi2H9 tt Unb, 
Vail, 22H29  *** VanH, 
Е ; ; ; T 
UniMi | Un2HM29  Á*^' ë UnnhPr 
= [uvilusva| +++ |у |0|0| - - [0] 
=  [wy|we|---|w,|0/0|---|0] = W. (4.209) 
Thus, 
VF = W = AU. (4.210) 


Since V is orthogonal, V^! = VT so that УТ AU = Е as required. Bii 


Although we will return to this subject in Chapter 9 we make a few additional com- 
ments concerning the SVD. For the regression models considered in this text A = XTX 
where X is a full rank n x m matrix with n > m (usually n > m). In this case r = m 
and F is of the form 


Hy 0 0 
0 ih 0 

F= . СТИ: (4.211) 
0 о... 0 


and the columns of W аге all nonzero. From УУ = AU, these span the column space of 
A. In fact, from the definition of W. 


Au; =4;V;, 1<j<m (4.212) 


where u; is the j-th column of U. The vectors u;,1 < j < m, are called the right singular 
vectors of A and у;,1 € j < m, are the left singular vectors of A. In fact, from (4.210) 


UTATAU—-U^!ATAU -U^!ATW =F. (4.213) 
Thus, АТУУ = FU so that 
Aly, = шщ, 1<ј < т. (4.214) 


From (4.213) and (4.214), {ш}, апа (vi);,, play a role similar to that of the 
eigenvectors of a symmetric matrix. 
When A is an т x m invertible square matrix, then u; > 0,1 € i € m, and then the 
ratio 
к = pa (4.215) 


is called the condition number of A. It’s importance derives from the following fact. 
Suppose we wish to solve Ax = b. In practice b may be subject to various errors such 
as round-off errors. Thus rather than solving Ax = b we solve AX = b+ Ab where Ab 
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is the error in b. The relative error in the computed solution [x — || / ||X|| is bounded 
by «АЫ [ib giving [112] 
1х — || ЛДЫ 
П ЛЫ 

Thus к measures the degree to which errors аге magnified by the solution process. 
(From (4.215) к > 1). If к is large (this depends on the precision of one's computer 
arithmetic and the required number of digits) we say that A is ill-conditioned otherwise 
it is well-conditioned. As we shall see in Chapter 9 the conditioning problem plays a 
major role in evaluating the validity of а regression model. 

It is of interest to determine what types of matrices are well-conditioned and par- 
ticularly those that are optimally conditioned with к = 1. Since the singular values of 
A are the eigenvalues of АТА, if A = Im, then ATA = 12, = Im and А; (Im) = 1, 
1 < 1 < m. Hence, к = 1. More generally, any m x m diagonal matrix with constant 
diagonal elements has к = 1. 

If A is orthogonal, then АТА = Im so again Л; (АТА) = 1,1 € i € m, апак = 1. 
Hence, orthogonal matrices are optimally conditioned and this suggests that for numerical 
purposes one try to work with matrices that are orthogonal or close to it. The farther 
away a matrix is from being orthogonal - ie., whose columns are not perpendicular 
vectors, it will be more poorly conditioned. The closer the angles between the columns 
are to zero, generally the larger the condition number. As we shall see, this geometric 
condition translates into the statistical notion of correlation. 


(4.216) 


4.9.1 The QR Decomposition 


As indicated above, numerical calculations with orthogonal matrices are important so we 
now consider one further decomposition of A in terms of an orthogonal matrix. 


Theorem 4.19 (QR decomposition) Let А be an n x m matrix (n > m) of rank 
m, then A can be factored as 
A=QR (4.217) 
where Q is an nx m matriz with orthogonal columns and R. is an m x m upper triangular 
matriz. 


The proof of (4.217) is a direct consequence of the Gram-Schmidt process. We will 
illustrate this for a 3 x 3 matrix, the proof in general follows along the same lines. 
Suppose 
А = [ay [ао|аз]| (4.218) 
where a;,1 <i < 3, are the columns of A. By the Gram-Schmidt process there is a set 
of three orthogonal vectors w;,l < i < 3, such that 


Wi = с1а], 
W = C981 + C382, (4.219) 
W3 = с4а + сѕа + саз. 


Letting wi = (wii, wis, шз)? and a; = (а;1,0;2, aia)? writing out (4.219) explicitly shows 
that 
Q@11 412 ал13 су C2 C4 
Q = Гм [мо уз] = Q21 аә Q23 0 C3 C5 — AS (4.220) 
a31 032 Q33 0 0 c 
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where S is upper triangular. Since A and Q are invertible by assumption, so is S. Hence, 
А = QS™. (4.221) 


Since the inverse of an upper triangular matrix is upper triangular, S! = R is upper 
triangular (4.217) follows [112]. M 


The importance of the QR decomposition in regression analysis stems from the fact 
that R can be produced numerically stably by applying a sequence of orthogonal trans- 
formations to А and this can be used in calculating solutions to 


XTXx-y (4.222) 
by forming the QR decomposition of X. In fact, if X = QR then ХТ = RT QT so 
XTX = R7Q7QR. (4.223) 
Because the columns of Q are orthogonal, QTQ = Im so that 
XTX = RTR (4.224) 


By the triangularity of R, we see that (4.224) is just the Cholesky decomposition of X" X 
with L = R” in (4.223). For numerical stability, this is generally the preferred method of 
obtaining the Cholesky factorization, hence the solution of (4.224). Further applications 
of these results will be given as we proceed. 


4.10 Exercises 


1 
1 10 1 0 4 2 1 1 
4.1 For A= | | Ap B=| a тот х= 0 andy=|_i |, 
—1 
find 
(a) AB (b) A7B (c) Ay (d) Bx (e) y AT Ay (f) A?-2A +I? 
1 2 4 2 00 
4.2 Let X — 0-1 -2|andY-|-2 5 0 
-1 0 3 0 -3 1 


(a) Find X?, Y?, XY, YX. 
(b) Show that (X + Y)? = X? + ХҮ + YX + Ү?. 


4.3 Verify that 
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1 6 
4.4 For А = Es 2 ‚ВвВ=/|2 8j. 
6 9 -7 3 4 


Find А + В? and A? +B, and explain the relationship between two sums. 


4.5 Prove the properties (i)-(vi) of the trace of a matrix. 


4.6 Prove that if A = diag(a;),1 € i < n, and a; X 0,1 € i € n, then A ! = 
diag(1/a;), 1 € 4 € n. In particular, I7! = Iņ. 


4.7 Prove Theorem 4.6 (i)-(v). 


1 —1 0 3 4 0 
4.8 16А = | 0 4 1|,В=|0 -1 2|. 
3 02 1 0 1 


(a) Partition A and B as 


х= | А ВЕЕ B 


Aoi À22 Bai B2? 


where both A11 and Bıı have the same dimension 2 x 2. 


(b) Find AB both with and without the partitioning, to verify the validity of 
multiplication of partitioned matrices. 


(c) Find AB? by showing that 


бе] BP]. 
Вз, B5; 


4.9 (a) Consider the vectors b — (2, —1,4, 0)* and d = (—1,3, —2, D Show that 
(ьа)? < (ьть) (a7). 
(b) Consider the vectors b = (—3,4)7 апаа = (2, 1)”. Show that 
(ьа)? < (bTAb) (dT A-!d) 


where 


4.10 Find the inner product of y— (хТу /x"x) x with itself. 


4.11 Which of the following transformation A : R3 — R? are linear? Let A (z, y, 2) be 
given 
(a) (z,y +1) (b) (2,2) (c) (0,0) 
(d) (zy, 0) (е) (2+ V2y) (f) (у,х+ v) 
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4.12 Let T : R? — R? be a linear transformation. 


(a) If T (2,3) = (—1,2) and T (1,2) = (5,2), what is T (4, 7)? [Hint: Write (4,7) 


as a linear combination of (2,3) and (1,2).] 
(b) If T (vi) = vi and T (wi) = wi, what is T (2v — 3w)? 


(c) Suppose T ( : ) = ( Е ) and T ( ] ) = ( E i Find a matrix A such 


that T is the action of A on R?. 

1 1/2 
0 1/2 
(1957, p. 120) [68]. Given that #9 = Af@-) (i > 1) show that 


3 __ | 1 3/4 go ea | 1 7/8 | eo) 
à ERIT о 1/8 |f 


4.13 Ina genetic study, the generation matriz, А = | 


апа m 
p) — Ang 2 | : 1-2 | ROJ 


where f®) is an arbitrary vector in №2. 


| is given by Kempthorne 


4.14 Let the row vector 17 = (1,1,1, 1), 17 = (1,1,1) and x? = (3,6,8, —2). Verify 


(а) 1Tx = x71, 
(b) 1413 = J4x3, having all elements unity. 


4.15 A square matrix J, which is quite useful and a variant thereof is given by: 


In =1,1} with J? = nJ,; 


and 


Particularly, a centering matriz, C, is defined by 
- 1 
n 


Verify that Cn = CT = C2, C,1, = 0 and C,J,, = Ј.С, = 0. 


4.16 Let A be asymmetric matrix. 
Prove that if A = A^, then det (A) = +1. 


3 4 1 
(а) det (AB) = det (A) det (В). 
(b) det (A7?) = [det (A)] ' . 


417 Let A= Е рае: i | Shaw (ist 
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а b b 
4.18 Le S= | b a b |,where b » 0. 
b ba 
(a) Find the value of a such that |S| — 0. 
(b) Find S !. 
—2 3 —1 
4.19 Let A= 1 2 —1 
z J 1 


(a) Calculate the transpose of A^! and inverse of AT. 
(b) Calculate the inverse of А 7. 


4.20 Let A be an n x n matrix such that AT = А -!. 


a 


(a) Show that any matrix of the form | = 


| | where a? + b? = 1, has this 


property. 
(b) Show that det (A) = +1 for such a matrix. 


4.21 Prove the followings. 
(a) If AB = L,, then det (A) 4 0 and det (B) Z 0. 


ES 
(b) For A and B symmetric, [(AB)"| = АВР. 
(с) Let P = X (XTX) ХТ, then P is symmetric and idempotent. 


4.22 Find the characteristic polynomial, the eigenvalues, and their associated eigen- 
vectors for each of the following matrices. 


2 1 1 1 0 0 —2 10 0 
ofiso zajo 9-2 оез о 


4.23 Let А be an upper triangular matrix. Prove that all diagonal entries of A are 
eigenvalues. (It is also called a proper value). |Hint: The determinant of an upper 
triangular matrix is the product of its diagonal elements. | 


4.24 Prove that А is an eigenvalue of А „х if and only if det (A — AI) = 0, where I is 
the n x n identity matrix. 
4.25 Define an observation vector (or data vector) x? = (z1,23,: - v4). 
(a) Show that the mean vector x = 4x71 = l1Tx. 
(b) Find the deviation vector C such that Cx = x? — 717. 
(c) Show that У, (x; – 2)? = x x — nz? = xT Cx. 


4.26 Let X, Хә, ..., Xn be a random sample with mean 0, and variance o?. Find the 
expected value of the quadratic form 


Q-(Xi eo p зум] ecce X сч S x). 
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4.27 Find matrices A such that the following quadratic forms are given by the product 
x? Ах. 


(a) 222 — бту — y?. 

(b) Az? + 34122 — 222. 

(с) 22 — iy? + 422. 

(d) 222 — 25115 + 22 + 42123 — 322. 


4.28 Let A= | T : | be a 2 x 2 symmetric matrix. Prove that A is positive definite 


if and only if det (A) > 0 and a> 0. 


4.29 Classify the following matrices as positive definite or positive semidefinite: 


4 1 2 1 0 —1 2 1 —1 
(a |1 4 -1 (b) | 0 1 0 (c) 1 2 1 
2-1 4 -1 0 1 —1 1 2 
4.30 Which of the following matrices are diagonalizable? 
1 1 -2 1 2 3 
(a) | К: | (b) i EJ ()| 4 0 4 (d|0 -1 2 
1-1 4 0 0 2 


4.31 Let A be a symmetric matrix such that х Ax yields the given quadratic form 
т? + 42у + y?. 
(a) Find the eigenvalues and eigenvectors of A. 
(b) Find an orthogonal matrix P such that P7 AP is a diagonal matrix. 
4.32 Let x = (2,1, -3), y = (—5,1,2), and z = (1, —4,2). If possible, compute 
(а) xy? (Ы) 7а (се) (ухТ)л  (d)3dy! (е) ухГухТ 


4.33 Consider the following linear system: 


22, +322 —3r3 +24 +25 E 

321 +2273 +825 = —2 

224 +322 —434 =g 
T3 +24 d ds = 5. 


(a) Find the coefficient matrix. 
(b) Write the linear system in matrix form. 


(c) Find the augmented matrix. 


4.34 Using Gaussian elimination method, solve the following sets of simultaneous 


equations. 
Te, т + 22 + 223 = —1 —%1 + 242 — 23 = —2 
(a) | дт bi a 1 (b) £1 — 229 + 23 = —5 (с) 421 = 229 + 273 md 
imm 341 + T2 + 23 = 3. 32, — 473 = —1. 
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4.35 Write each of the following as a linear system in matrix form. 


9)a(73)*($)-(1). 


2 1 0 4 
(b) т, 1 + |2 | +2з | -3 |=| 3 
—3 1 2 2 


4.36 Let xı = Oxy хо = (4, “кй, хз = (1799) belong to the solution 
space of Ax = 0 where A is a nonzero 3 x 3 matrix. Is (x1,x2,x3) linearly 
independent? 


4.37 Decide whether or not the following vectors are linearly independent, by solving 
C1 Vi + С2У2 + сзуз = 0; 


1 2 1 
(a) vi = 2 |], vo] 2 |, уз=| 4 
—1 0 3 
1 2 1 
(b)vi={ 2 |, v-| -2 |, уз = { 1 
5 4 4 
4.38 Let 


ieee a | /2 V2 
Т = н у T 1 —1 : m _— „ыш 
{ (59255) ” (1,0, ,0), у 9 ‚0, 2 ‚0 


(a) Identify tT, и? and У? are orthogonal or orthonormal. 
(b) Find the angles of each pair of vector. 
(c) Compute the length of each vector. 
4.39 Let xT = [1, 3,2], y? = (2, 4,5]. 
(a) Find the vector such that ӯ = bx where b = (x, y) / |х|. 
(b) Find ӯ — у and show that ($ — y) Lx. 


4.40 Draw a picture and prove the parallelogram law: 
2 2 2 2 
+ yl? + |х — yl? = 2 (112 + yl). 


4.41 Let x; = (1,1,0]7, хә = [1,0,1] and хз = [0,1,1]. Find the orthonormal 
vectors qi, q2 and qs. 


4.42 Find an orthonormal basis for the subspace of IR? consisting of all vectors of the 
form 


(a) (a, a + b, b) (b) (a,b,c) such that a 4- b +с = 0. 


4.43 Consider the Euclidean space R^ and let W be the subspace that has S = 


{ |1, 1, 21,0], [0, 2,0, 1} as a basis. Use the Gram-Schmidt process to obtain an 
orthonormal basis for W. 
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4.44 Find the factorization A = LDL” where D is a diagonal matrix, and then the 
two Cholesky factors in 


тр!/?\ (yp!2)' for 
(rp) (р!) . 


2 6 
Ar | 6 21 | 
4.45 Find the QR decomposition of 
12 2: ч] 1 2 —1- 2 0 
(а) | E ] (b) | —1 3 (с) | -1 -2 (d) 1 0 2 
0 1 1 1 —1 —2 2 


4.46 Use the Gram-Schmidt process to 


ll 
gn 
pem 
үү 
US 


u = (0,0,1)’, v=(0,1,1)’, w 


and write the result in the form A = QR. 


4.47 With the same matrix A, and with b = (1,1, D. use A = QR to solve the least 
squares problem Ax = b. 


4.48 Find the singular value decomposition (SVD) of the matrices (b) and (c) in 
Exercise 4.45. 


4.49 Let Хзх1 be Ns (и, У) with 


a) Find the marginal distribution of Xi. 
b) Find the joint distribution of Хэ and X3. 
c) Find the distribution of Z = 2X1 — Хә + 3X3. 
d) Find the conditional distribution of Ху, given that Xə = хә. 
) Find the conditional distribution of (X1, X2) given that Хз = 23. 
) Are X, and Хз independent? Why? 


( 
( 


e 
f 


X3 


) 
h) Find the covariance matrix of (X1, X2) given that Хз = x3. Using this find the 
correlation of (X1, X2) given that Хз = 13, pj оз. 


( 
( 
( 
( 
(g) Find а 2 x 1 vector a such that Xə and Xə — al ( 2 ) are independent. 
( 

4.50 Using X given in Ex. 4.49, find the covariance of Z4, and Z2, where 


Zı = Ху —2X24+3X3—-6 
Z9 —2X144 3. 


4.51 Suppose Xpxı~ N (p, 3). Find the moment generating function of y, where 
y-x-p. 
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4.52 Suppose xpx1~ N (и, X), where У is positive definite, the pdf is given by 


бан, ®) = Q5)" te^ exp (- 20) 


where Q = (x — и) X! (x и). 

(a) Show that the mean vector x is the solution for x to Of (x; и, €) /дх = 0. 

(b) Find the solution for x to 9Q/Ox = 0. Is this equivalent to the one you obtained 
in (a)? 


4.53 Suppose a random vector хох is distributed N (и, ©) and the pdf is given by 
f ба E) = (т) I" exp (-50) 


where Q = 322 + 222 — 4512 +145 — 812 +10. (Note: This is called the bivariate 
normal distribution). 


(a) Find и and X^. 
(b) Find E (Xı|X2 = 21) . 


Chapter 5 


Multiple Regression 


5.1 Introduction 


In this chapter we extend the ideas developed in Chapter 3 to fit models of data when 
there is more than one independent variable. As we shall see, the use of matrix algebra 
greatly simplifies many of the calculations and will be used consistently throughout the 
chapter. 


5.2 The General Linear Model 


In Chapter 3 we studied the problem of using a single independent variable x to aid in 
estimating the mean value Е (Ү,) of a family of random variables Y,. As we pointed 
out in Example 3.21, а single variable may be inadequate to explain the variation in 
some variable Y. In this chapter we consider the problem of regressing Y on one or 
more variables as a generalization of the techniques developed in Chapter 3. This topic 
is usually called multiple regression and the resulting models multiple regression models 
or general linear models. 

As a specific situation, we might expect that the income a person earns depends 
on а number of factors such as the number of years of education, the type of job, age, 
sex (kind; not how much) and no doubt others. Likewise, the price of a house generally 
depends on its size, location, configuration and number of rooms, age, lot size and others. 
Many other examples of such phenomena will be given as we proceed. 

Suppose now that we have m > 1 independent variables z1,22,...,2,,, denoted in 
vector form as (21,25, ..., ©) = x. Let Y, be a random variable whose mean E (Yx) = Hx 
depends in some fashion on x. Generalizing (3.1) we will assume for now that 


Yx = bo + У Bx; + Ex (5.1) 


j=l 


where 8;,] = 0,1,...,m, are fixed, but generally unknown, real numbers, and єх is an 
error random variable satisfying E (£x) = 0. Thus, 


Hx = E (Yx) = Bo + У jns, (5.2) 


j=l 


180 CHAPTER 5. MULTIPLE REGRESSION 


so that we are assuming that the dependence of p, on the regression coefficients (81,81, ..., 
Bm) is linear. The model given by Eq. (5.1) is usually referred to as the general linear 
model. 

As for the simple linear regression model, we refer to 8;,0 < j < т, as the regression 
coefficients; the x;’s as the independent variables (variables for short) or regressors. Bo is 
usually called the intercept and Y, the dependent or observed (response) variable. When 
a particular variable x; is quantitative, that is, it assumes values in an interval of the real 
line, then 8; is sometimes called the i-th slope, since it represents the rate of change of 
Y, in the direction of z;. That is, 8; is the amount that Yx changes when z; increases by 
one unit when all the other variables are held fixed. (In calculus terms 8; = ди, /Ozi, the 
partial derivative of p, with respect to z;.) When z; is qualitative, that is, it can assume 
only a discrete set of values, often only 0 and 1, then no such calculus interpretation of 6; 
is usually possible. Some authors refer to Bj, 0 < j < m, as partial regression coefficients, 
but we will not use this terminology here. 

For the remainder of this chapter we shall assume that the x;’s are deterministic, and 
can be measured without error. Similarly, it is assumed that the values of Y, can be 
measured without error as well. 

To illustrate the scope of the general linear model in practice we now present a number 
of general and specific models that have appeared in the literature. The reader will see 
that the general linear model encompasses a wide variety of statistical possibilities. 


Example 5.1 In Example 3.4 we considered using a simple linear regression model 
to explain the delivery time for drinks to a customer as a function of the number of cases 
delivered. А further examination of the data suggested that the distance the delivery 
man had to walk could be of some significance. To investigate this possibility a linear 


model 
Yx == Bo + Puri T 8522 + Ex (5.3) 


was considered. Here, as in Example 3.4 Y, is the delivery time in minutes, xı is the 
number of cases delivered and £2 is the distance walked, in feet. 


Example 5.2 The amount of water used in a production plant is quite large. In 
order to understand the factors determining the amount of monthly water usage, the cost 
control engineer considered using a linear regression model of the form 


Yx = bo + 8,21 + 8,22 + зз + 24 + Ex (5.4) 


where 


Ү = monthly water usage in gallons, 

т = average monthly temperature, 

£2 = amount of production (pounds), 

хз = number of operating days in a month, 
z4 = number of persons on the plant payroll. 


Example 5.3 In exercise physiology an objective measure of aerobic fitness is the 
oxygen consumption in volume per unit body weight per unit time (Y). To determine 
if it was possible to predict this quantity, an experiment was conducted and a linear 
regression model 


Yx = Bo + 0121 + Bore + 83123 + 424 + 9525 + 8626 + Ex (5.5) 
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was proposed. Here the variables 


тү = age in years, 

то = weight in kilograms, 

хз = time to run 1.5 miles, 

z4 = resting pulse rate, 

Z5 = pulse rate at the end of the run, 

тв = maximum pulse rate during the run 


were used to attempt to explain the results of the experiment. 


Example 5.4 (Polynomial regression models) In the case of the simple linear re- 
gression model we pictured the data (x;,y;) as being roughly scattered about a straight 
line. In this case the model is linear in both the parameters (8o, 81) and the indepen- 
dent variable x. In the general linear model, the linearity property refers only to the 
linear behavior in the coefficients (o, £,, ..., B es The model need not be linear in the 
independent variables. 

To clarify this, suppose x; = z^ where x is some explanatory variable. Then consider 
the model 


m 


Y, = bo + 812+ 852? +- - -+ B," + Ex- (5.6) 


In this case, the model has a polynomial behavior in =, but is still linear in (89, £ , ..., B. 
Such a model is called a polynomial model of degree m. 

One can also use the general linear model, to model polynomial behavior in two or 
more variables. For example, a two variable quadratic model would be of the form 


Yx = bo + B121 + 821 + Baza + 84105 + Вуху + Єх. (5.7) 
We will discuss these models in greater detail in Chapter 7. 


One of the most important aspects of the general linear model is its ability to describe 
both quantitative and qualitative behavior simultaneously. 


Example 5.5 Suppose that it is felt that the salary that a person earns depends in 
a linear fashion on the number of years worked but that the rate of growth depends on 
a person’s sex. (Such models can be useful in studying employment discrimination.) To 
test this possibility data were gathered and it was planned to fit a model. How should 
one proceed? 

Of course one possibility is to fit two straight lines, one for each sex, and be done. 
But in a study of this type one of the things that we might be interested in is the ability 
to determine whether sex does have an influence on the rate of change of salary. Thus 
we will want to determine whether the salary models for males and females differ in 
their slopes. Although there are a number of ways of doing this, it is perhaps most 
convenient (and efficient) to have one model which will fit the male and female salary 
data simultaneously. This can be done using the general linear model by introducing a 
dummy variable x2 where 


yes | 0, if person is female, (5.8) 


1, if person is male. 
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Letting zı = years worked, the model 
Yx = Bo + B4z1 + 8322 + B42122 + Ex. (5.9) 


can be shown to model male and female salaries simultaneously. To see this, let x2 = 0, 
then 
Y, = Bo + 821 + Ex 


which represents the model for female salaries. If х5 = 1, then 


Ү = Bo + bızı + 8522 + 832122 + €x 
= (Bg B3) + (B, + B3) 21 + Ex 
= В+ ym + (5.10) 


which represents the model for male salaries. Note that in this representation 8, rep- 
resents the difference between male and female starting salaries and д: represents the 
difference in annual rate increases. In this form it is easy to test for significant differences 
in these quantities. We shall take up such models in greater detail in Chapter 7. In the 
statistics literature these are sometimes called analysis of covariance models [88]. As we 
also shall see in Chapter 7, it is possible to consider models having only dummy vari- 
ables, called analysis of variance models. These models historically have been treated as 
separate statistical models with their own terminology, methodology etc. [27]. However, 
increasingly one finds that such models are being analyzed as particular examples of the 
GLM [27, 88], which appears, at least to these writers to remove a great deal of mystery 
and complication from the subject. 

Although the linearity assumption in (5.1) is theoretically restrictive, the proper 
choice of z;,1 < i € m and Y, often allows one to represent many physical phenomena 
quite adequately over appropriate ranges of the independent variables. Again we will 
deal with this form of the model until Chapter 7. 


9.3 Least Squares Estimation 


5.3.1 Estimating 8 


As for the case of simple linear regression, the first problem we deal with is that of 
estimating the regression coefficients (8o, £,,..., Êm) and the residual variability of єх. 
То do this we assume that n, not necessarily distinct, measurements have been taken of 
the independent variables x = (21, £2,...,%m). Let vij; be value of the j-th variable for 
the i-th measurement. Then, letting x; = (zii, 2i2, ..., tim), 


т, 
Ya == 8+) уу tei, 1<іхп, (5.1) 
j=l 


where €; = €x,. 

The observed values of Y; at x; will be denoted as usual by the lower case value 
yi. Our basic problem is the estimation of 8 = (bo; £,, ..., By from the pairs of data 
(xi,yi),1 <i € n. For the remainder of this chapter we shall assume that n > т + 1 
so that we have at least one more measurement than the number of parameters (0;);. s. 
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Figure 5.1: A Linear Regression Surface with m = 2 


Geometrically, if we think of representing (x;,y;) as a point in R™*!, then in analogy 
to simple linear regression, we may think of the estimation problem as one of fitting an 
т + 1 dimensional hyperplane to this scatter of points. An example of this situation for 
m = 2 is shown in Figure 5.1. 

Before proceeding with the actual estimation details it is convenient to rewrite the 
model (5.1) in vector-matrix form. This notation, coupled with the techniques of linear 
algebra surveyed in Chapter 4, provides a powerful method for developing the properties 
of the general linear model. 

As in Chapter 4 (and so far in this chapter) vectors and matrices will be denoted by 
boldface letters, and if necessary their dimensions will be denoted as in Eq. (5.16). 

Let Y = (Yi, Yo, ..., x denote the vector of random observations and their actual 
values by 


T 
У: (91,392; 5 Yn) ' (5.12) 
The error vector є is given by a 
E = (&1,&2,...,©м) (5.13) 
and the vector of regression coefficients is denoted by 
B = (Bos Bis Bm) . (5.14) 
Let 
l zu Lim 
l zz Tm 
le ы. бл) 
1 Xni «жє Фат | nx(m4t) 


where X is called the design matriz. (This terminology is used even if the values of x; do 
not result from a pre-planned experiment [104, 45].) With these definitions (5.11) takes 


the form 


Y= X +e. (5.16) 
nXl тх(т+1)(тф1)хі 2х1 
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For now we make the same assumptions on є as these given in Section 3.2. That 
is, €;,1 < i € n, are assumed to be independent, and each =; is № (0,07). Thus, the 
vector Y has a multivariate normal N (Ха, 012) distribution. These assumptions, as 
in Chapter 3, lead naturally to choosing maximum likelihood estimation as a way of 
estimating 3, and then to least squares estimation for more general error models. 

When the errors are independent N (0, c?) , the likelihood function of y is given by 


2 
n 


1 1 m 
№ (y) =|] mus PP rings yi — Bo У 28; . (5.17) 


i=1 T j=l 


Writing и; = Bo + 05-1 21383, (5.17) takes the form 


fy(y) = wa exp Pe ЎЗ (y; — шщ)? 
"Uma 5" ехр E (y - њу - "I (5.18) 


where pt = (ш, Ho, nd and (y — 4, y — и) is the dot product of the vector y — u 
with itself. Since и = E (Y) = XB, (5.18) becomes 


fy (у) = т exp l-z (y - ХВ,у – xe) (5.19) 


(c 27) 20? 


From (5.19) we see that maximizing fy (y) with respect to {3 is the same as minimizing 
the sum of squares of the residuals 


g (B) = (y - XB, y - XB). (5.20) 


The usual approach to doing this minimization is to differentiate 0 (8) with respect 
to 8;,0 € à € m, and solve the resulting equations obtained by setting these derivatives 
to zero. This is the approach taken in Chapter 3 for the simple linear regression model 
(but, see the derivation in Section 3.2). However, one is still required to verify that 
the estimates so obtained actually do provide the minimum and this requires additional 
calculus. Due to the quadratic behavior of g (8) as a function of B, a purely algebraic 
approach to the minimization is possible. This is the route we follow. The calculus 
approach will be outlined in the Exercises. 


Theorem 5.1 Consider the linear model given by Equation (5.11). If the errors are 
independent М (0, с?) random variables then B is a maximum likelihood estimate of (8 


if and only if В is a solution to the set of linear equations 
X^Xg = XT y. (5.21) 


If XTX is nonsingular, then (5.21) has a unique solution which is given by 


1 


B= (XTX) Xy. (5.22a) 
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In this case the estimator B is given by 
В = (XTX) XTY. (5.22b) 


(Note that we use the some notation for the estimator and its values В. This 18 standard 
practice, and should, we hope, cause no confusion.) 


Note: In many problems, particularly those of analysis of variance type, (5.22b) can 
lead to singular matrices X^ X. Thus the MLE of 8 need not be unique, so that the 
assumption of nonsingularity of X^ X is not superfluous. 


We will show that the proof of Theorem 5.1 is a straightforward consequence of the 
identity 
9(8) -«(B) = (x (8- B).x (8-8). (5.23) 


where 8 satisfies (5.22b) and the fact that (5.21) always has at least one solution. 
To establish (5.23) consider 


g(B) = (У- ХВ,у - ХВ) 
= (у,у) —2(y, XB) + (X8, X8) 
= (у,у) –2(Х?у,8) + (X8, XB). (5.24) 
Using X^ y = XTXf in the middle term of (5.24), gives 
g(B) = (y)-2(X' y.B) + (X8.X8) 


|| 


Since (5.25) holds for all vectors (8, it certainly holds for 8 = 8. Thus 
g (8) = (у,у) –2 (xB. хд) + (XB. X 
(У,у) — (XB. хд) (5.26) 


|| 


апа 
g(3)-9(B) = (х8,х8) –2(х8,х8) + (X B.XB) 
(X8 - XB,XB - XB) 
(X (8—8),х (в— 8). (5.27) 


|| 


|! 


Proof of Theorem 5.1. We first show that any solution to (5.21) minimizes g (8). 
To see this, observe that 


9(8) - e(8) = (x (8- B).x (8- 8)» >20, (5.28) 


so that g (8) >g (8) for all vectors 3. Thus Ê minimizes g (8) and so is a MLE of В. 
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On the other hand, suppose that B, minimizes g (З). Then g(8) > g (й,), so in 
particular g (8) > 9 (й,). Now using the identity (5.23) with 3 = 8 gives g (8) > 
g (8). Thus, g (А) = (8). and using (5.23) again, we get 

(x (à - 8) ,x (8: - 8)) =0 (5.29) 
so that 
x (ё, = 8) = 0. (5.30) 
Multiplying both sides of (5.30) Ьу ХТ on the left and using (5.21) again, we get 
ХТХВ, — XTXB = XTXf, — XTy =0. (5.31) 


Thus, XTX, = XTy and ĝ satisfies (5.21). Hence, every MLE of [8 is obtained by 
solving (5.21). 

If XTX has an inverse, then the solution to (5.21) is unique, so that there is a unique 
MLE given by (5.223). M 


Since Theorem 5.1 shows that the maximum likelihood estimator B of B can be 
obtained by minimizing the residual sum of squares g (B) = (y — XB,y — XG) with 
respect to В and g (8) is differentiable, this minimum can be found by the standard 
calculus method of solving the simultaneous equations 


0g(8) a. 
36, ^ 0, j =0,1,2,...,m. (5.32) 


Carrying out these differentiations (see Exercise 5.5) we again arrive at the normal 
equations ХТХВ = XT y. When written out in full, these equations become 


n Увы тыр c Урі Tkm 5 $i Yk 
: Ук. 2 І : 
Уе Tkm ` а » а B, Rel LkmYk 
(5.33) 
To see this, we write X in partitioned form as 
X = хіх] cers Xm] (5.34) 
where 1 = (1,1, ..., 1)’, so that 
1 
х1 
xe (5.35) 


Xm 
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and using the rules for multiplying partitioned matrices 


1 LE (1,x1) as (1,x4) 
Xj (1:1) {xuko s (XL Xs) 
XTX = | [1|x1|x2|--- [xm] = | | o. | . (5.36) 
Xm (1,x4,) (Xm, X1) uds (o deo c 
Similarly, 
1 (1, y) 
х1 (х1, у) 
ХТу = | ‚у= | (5.37) 
Xm (Xm, y) 


and writing (5.33)-(5.37) in components gives (5.32). 

Theorem 5.1 shows that a maximum likelihood estimator of 8 must be a solution to 
the normal equations (5.21) (also called the least squares equations.). However, this does 
not guarantee that the MLE exists, unless we show one of two things: 


(i) Equation (5.21) always has at least one solution. 


(ii) There exists a vector 3 minimizing g (9). 


Of course if (XTX)" exists, we are done, and this is the case we shall consider in 
most of our work. However, for completeness, we will show that (i) is true even if this is 
not the case. For this we need the following fact from linear algebra, stated as Lemma 
5.1 without proof. 


Lemma 5.1 The set of linear equations Ax — y has a solution if and only if 
(y,z) = 0 where ATz = 0. 


Theorem 5.2 The least-squares equations (5.21) always have at least one solution. 


Proof. From Lemma 5.1, we must show that (XTy,z) — 0 for every z such that 
XT Xz = 0. Now if XT Xz = 0, then (XT Xz, z) = (Xz, Xz) = 0, so that Xz = 0. Thus, 


(ХТу, 2) = (у, Xz) = 0 (5.38) 
and the theorem is proved. M 


It now follows from Theorems 5.1 and 5.2 that the MLE of 8 can be found by solving 
a set of m 4-1 linear equations in m+1 unknowns. Since there are relatively few situations 
where such solutions can be found analytically, in virtually all cases numerical techniques 
need to be used. This in itself is a problem of considerable complexity, since as pointed 
our in Chapter 4 well known techniques such as Gaussian elimination may run into 
difficulty [112]. 
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As we shall see, experimental data frequently leads to ХГХ being highly ill-conditioned 
which can seriously affect the numerical accuracy of the computed value of B. Hence, in 
recent years, regression calculations are often made using techniques such as Cholesky 
factorization the QR. decomposition or the SVD. We shall return to this matter in Chap- 
ter 9. 

Despite the fact that modern statistical packages, such as SAS, SPSS and MINITAB 
use these techniques and are generally believed to be reliable one should be cautious in 
uncritically using these “black boxes" without examining the output for possible compu- 
tational glitches. In searching the internet recently, we came across numerous questions 
concerning numerical results which appeared to be in error. In general, it is wise to check 
for things such as the signs of 8;,0 < i < m, and their magnitude, since ill-conditioning 
can cause even sophisticated methods to fail. In this regard, one can perform a simple 
test of accuracy by adding small errors to the data and running the regression again and 
comparing it to your original output. Small changes in 8;,0 < 4 < т, suggest reliability 
of the software. 

As for simple linear regression, we can estimate 8 by minimizing 9 (8) regardless of 
the nature of the errors. In this case Ê is called the ordinary least squares estimate (or 
just least squares estimate of 8) and is often abbreviated to OLS. Currently, this is the 
most common method for estimating 3 in the GLM, although considerable research has 
taken place in the last two decades on alternative approaches [27, 87]. Some of these 
procedures are discussed in later chapters. 

Since the normal equations always have at least one solution, it is important to be 
able to identify when this solution is unique, without necessarily having to form X7 X. 
It turns out that we need only examine the columns of X. 


Theorem 5.3 ХТХ is nonsingular if and only if the columns of X are linearly 
independent. (In this case (5.16) is called a full rank model.) 


Proof. We will first show that the linear independence of the columns of X implies 
the nonsigularity of XTX. Suppose this is not the case. Then by Theorem 4.2 there is 


a non-zero vector c such that 
X7Xc = 0. (5.39) 


Thus, taking the dot product on both sides of (5.39) with с we get 
(c, X" Xc) = (Xe, Xc) = 0. (5.40) 


But (5.40) says that Хс = 0 and writing this out in full shows that 
m 
Ус ex) (5.41) 
i=0 


where с = (с, C2, ERE and x; is now the i-th column of X. This shows that the 
columns of X are linearly dependent, which contradicts our assumption about X. Thus 
X7 X is nonsingular. 

On the other hand, suppose that XTX is nonsingular but the columns of X are 
linearly dependent. Thus there exists a non-zero vector c such that 


Xe = 0. (5.42) 
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Multiplying both sides of (5.42) by ХТ gives 
XTXc—0 (5.43) 


and this shows that ХГХ is singular. Again we arrive at a contradiction and so the 
columns of X must be linearly independent. M 


For some types of problems, linear independence of the columns is easily checked 
a priori, and ideally should be done before using a statistical package. However, most 
modern packages are reasonably “idiot-proof” and will inform the user if a singular matrix 
is encountered during the process of computation. In most cases the program will stop 
execution with the output of an appropriate error message. 

Although it is certainly possible for XTX to be singular, what is a more frequent 
occurrence, and more difficult to remedy, is the possibility that the columns of X may 
be "approximately" linearly independent. Roughly speaking this means that there is at 
least one column, say the j-th, which is an approximate linear combination of а subset 
of the remaining ones. That is 


k 
ху= у, Ci Xi, + (5.44) 
1=1 


where (lta is a subset of (0,1,2,...,j — 1, j +1,...,m} and 6 is a “small” vector. (More 
precise definitions are given in Belsley, Kuh and Welsch (BKW) [8] and will be discussed 
in Chapter 9.) Relation (5.44) suggests that x; may be superfluous in explaining Y 
once (xal have been included in the model. This problem of multicollinearity, as it 
is frequently called, can make it difficult to calculate B numerically and to interpret the 
statistical significance of the regression coefficients. Numerical examples illustrating this 
problem will be given later in the Chapter. 


5.3.2 Some Analytical and Numerical Solutions of the Normal 
Equations 


As we have already seen, even for simple linear regression, practical applications of re- 
gression analysis require a computer in order to do the calculations. However, there are 
a number of cases where closed form solutions can be obtained to the normal equations. 
Although some of these solutions may be used infrequently as practical computing tools, 
they are often quite useful in a variety of theoretical analyses of the regression problem. 


Analytical Solutions 
Ав our first example of an analytical solution to the normal equations we will rederive 


the least squares estimators for the simple linear regression model. 


Example 5.6 Since we have only one regressor in this case, we will let x; = x to 
make our results consistent with the notation used in Chapter 3. Thus, (3.1) of Chapter 
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3 can be written in vector-matrix forms as 


Nn 1 Z1 €1 

Ү» 1 T2 E2 
= + 5.45 
Aut. | (5.45) 

Yn 1 Tn En 


or equivalently as in (3.4) 
Ү= X84 e. (5.46) 


The least squares equations then become 


1 um yi 

1 T2 , y2 
йе cf : Bo Y Lm f ; (5.47) 
21 "P Zm By 21 "p Doi . 

I Ж Yn 


п M : | Bo )- MEC (5.48) 


or equivalently 


nÊ +ô " Tti = : i 
_ Е digi 1° (5.49) 


Bo ae ж +2, PO 1? = о; Liyi 


which are clearly the same as equations (3.11) and (3.14) in Chapter 3. 

For these equations to have a unique solution the columns of X must be linearly 
independent and for this to be true at least two elements of (z1,22,...,:24) must be 
different. This is equivalent to having S,, # 0. In this case (5.49) can be solved as in 
Chapter 3 to yield (3.18)-(3.19). 


Example 5.7 (Centered variables) Although the normal equations for the simple 
linear regression equation can be solved explicitly in a fairly simple form, analytical solu- 
tions obtained by standard algebraic methods such as Cramer’s rule become increasingly 
complex as the number of variables increases. In this regard, we introduce a simple trick 
which reduces the dimensionality of the system of normal equations by one. In particu- 
lar, a regression problem with three variables can be solved using only two, rather than 
three simultaneous equations. In that case, the analytical solution becomes as tractable 
as that for the simple regression model. 

To proceed, let 


1 TL 
у= = у m 1<ј<т, (5.50) 
i=l 
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denote the mean of the j-th column of X. Then by adding and subtracting 7; and y 
from both sides of (5.11), it becomes 


Yi - 9 — Bo - y-- Y BF; - Y (zij -2;) B; ceu 1E i n. (5.51) 
j=l j=l 
In particular, if m = 1, then 


Yi-y = Bo yd Biz 4 (2; т) 8; + =; 
—— um pe еа (5.59) 


where, as before, we denote z1; by zi. 

In this case it is easy to find the least squares estimates of (Boe 8 A and relate them 
to the least squares estimates of (0, Ву) in the original model. If we let Yie = yi — 7, 
Lic = Ti — ©, it follows from (3.19) in Chapter 3 that 


Во. = Vo — Тобі, = 0 (5.53) 
since y, = Ze = 0. Also, 
д, - wea Vic (Lic — Te) (А oum (yi — 7) (zi — 2) 
i 2:4 (Lic а тЫ Ул (x; = т)? 
Since By, = By — y + 8,2, we can obtain the least squares estimate of 3, by setting 
Bo = boc +7 — д7 — y — Biz. (5.55) 


From these calculations we obtain the following procedure to obtain the least squares 
estimates of (8s, 81) in the simple linear regression model. 


= f. (5.54) 


(i) Form the centered variables: yie = y; — y and Zic = x; — T. 
(ii) Regress у on Zie to obtain the least squares estimate By of 8}. 
(iii) The least squares estimate д, of Bo is then obtained from (5.55). 


Using some tedious algebra resulting from the representation (5.49) we can obtain 
the least squares estimates of 3 in the GLM in a way similar to that for m = 1. 


(1) Find the least squares estimates of (81, b2, ..., Bm) by solving the modified least 
squares equations | 
By 
8, 
[Хех|| |= ху, (5.56) 


Dm 


where [X,] ij = [zi -; 1X i € n, 1 € j < m, is the centered design matriz and 
= = T 
Yo = (y1 — 7,02 — T, Yn — 9) (5.57) 


is the vector of centered observations. 
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(i^) Bg is then given by 
т, 
bo =7- У ду. (5.58) 
j=l 
Note that this procedure requires solving m equations, one less than for the full least 
squares equations. In particular, the coefficient estimates for a two variable model may 
be obtained by solving only two equations in two unknowns, a task generally no more 


difficult than for the simple regression model. A numerical example illustrating these 
calculations follows. 


Example 5.8 (Centered variables; m = 2) We begin by considering the general 
model 
Y= bo + 8,121 + 8522 + E. (5.59) 


By centering the variables we can rewrite (5.59) as 


y—-y = В-9+ 8, (21 — 1) +811 + Bo (zo — T) + B332 + € 
= Bo -Yt Bii + 22 + By Lie + бәс + = (5.60) 


where y. = y — 7, Vie = 24 — Тү and хо = £2 — Xo. In this form the model becomes 
Y. = Bo + Віс21с + Взс22е + = (5.61) 


where 
80 = Bo — Y + 8111 + Baza. (5.62) 


One can now estimate (80, bic Bo.) by least squares. As for the case of simple linear 
regression it сап be shown that the least squares estimates of (30, Bic» Boo)’ are 


Bo = 0, Bi ig B3; = B; (5.63) 


where 8;, are the least squares estimates of 8,1 = 1,2 from (5.61) and B; i= 1,2, are 
the usual least squares estimates. Thus, it suffices to start with the model 


Yo = Вус + B5z2c + €, (9.64) 
estimate (84, 62)” by least squares and then estimate 80 in (5.59) by 
Bo =F — Bym — BoB. (5.65) 


This is a simplification because now only two simultaneous equations have to be solved 


to estimate (80, 81, 85). . To estimate TINA we use (5.46) where 
Tile  T12c 
T21c | T22c 
X. = | | f (5.66) 


Znle Tn2e 
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so that (dropping the c’s for convenience) 


Уу z3 PME 711712 | (5.67) 
24 Tati po Ti 


Introducing the shorthand notation X; 227, = $;22,9 , tatiz = У\хтутә etc., 
Eq. (5.46) can be solved using Cramer’s rule to give 


(2, 25) ($ ziy) — 22122) (У zov) | 


B = (XTX) xTy = LER mE os oan | 


XTX = 


5 (5.68) 
2,212,25 — Q2 7122) 
Thus, ^ 
p,- (Es) Caw) - Cam) Cow) (565 
3222522 — (352122) 
8, = Ос) Qon) - (5ш) (X m) (5.70) 


321323 - rr) 
Using (5.69)-(5.70) we now consider fitting the model 


Y = Bo + 821 + Bote + є 


to the following data: 


Ti T2 у 
—5 5 11 
—4 4 11 
-1 1 8 
2 —3 2 
2 —2 5 
3 —2 5 
3 —3 4 


T=0 74-0 y=46/7 


To do the fitting using (5.69)-(5.70) we first compute the centered data. This is given 
by 


Tic Z2c Ус 
=5 5 31/7 
—4 4 31/7 
=i 1 10/7 
2 E —32/7 
2 =й —11/7 
3 =? zd 


3 -3 —18/7 
Ус 20 У уто = 0 Уус = 0 
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We now compute the various sums of squares and cross products required in Eqs. (5.68)- 
(5.70). 


Tie T2c TlcT2c Ticy T2cy 
25 25 —25 —155/7 155/7 
16 16 —16 —124/7 124/7 
1 1 =l —10/7 10/7 
4 9 —6 —64/7 96/7 
4 4 E 25217 22/7 
9 4 —6 —33/7 22/7 
9 9 ~9 —54/7 54/7 


SS A a ЕЕ ИНЕ АНЕ a a e O a ve ^ nd 
У\22 = 68 9222-68 Priz: = -67 P riy 2 —— D ry = —— 


From the data, (5.62) and (5.69)-(5.70) we get >> x? Y 23 — ($2123) = 68? — 67? = 
135, and 


х —68 (462/7) + 67 (483/7) 
By = О 135 7 = 1, 
: —67 (462/7) + 68 (483/7) 
Bo = ^ —— 185 =2 


and А К E 
Bo = y => Вії = д2 = 46/7. 
Finally, we note that although using centered variables can simplify computations, its 


primary use these days is to improve the conditioning of the normal equations for better 
numerical accuracy and statistical interpretation. 


Example 5.9 (Centered and scaled variables) Sometimes, in addition to centering 
the regressor variables, it is helpful to scale them as well. The most commonly used 


scaling is one in which the Euclidean length of each column of X. has length one. That 
is, if x; is the j-th column of X,, then each element of x; is divided by 


т 1/2 
8j = p (ж — sr (5.71) 
і=1 
If X,. denotes the centered and scaled design matrix, then 
Х.. = X.S (5.72) 
where 8 is the diagonal matrix S = diag(1/s1, 1/52, ..., 1/8). 
The centered and scaled version of the model is 
Y.=X,-8, +є, Ү,=Ү-Ү1 (5.73) 
and the least squares estimate of G, is given by 


B,—-(XLX4) XLy. (5.74) 
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Using (5.74) and the fact that the OLS estimates of B,,1 <i < m, are obtained by 
solving the centered variable equations, it follows that 


B, са SB,. (5.75) 


To see (5.75) use (5.72) and (5.74) to give 
A —1 
д, = |(Xe8)” x.s] (X.8)7 у, (5.76) 
Since S is diagonal, ST = S, so that (X,S)! = SXT and 
T = T = -1 (gT 16-1 
[(Х.5) X.S| = Bs X.S| = 8-!(ХТх„) 18-1. (5.77) 
Thus, 
В, =S7! (ХТХ,) `8718Ху„=8-1(ХТХ„) "Ху =8-1Й,. (5.78) 


Hence, B, = SB.. А 
This leads to the following algorithm for computing 8 using centered and scaled 
variables: 


(i) Form Xsc; 

(ii) Estimate 8, from (5.74); 

(iii) Calculate 8, by multiplying the i-th component of B, by 1/5512 1,2, ..., n; 
(iv) Obtain д, from (5.58). 


Again, to reduce problems of ill-conditioning, many computer programs perform cal- 
culations using X,.. Because it follows from (5.71) and (5.72) that X,, is the matrix of 
sample correlation coefficients of the columns of X one often says that the regression is 
done in correlation form (27, 87]. 

If one scales у„ as well - denoted by yz, - the solution b of 


(XLX,.)b = Xsc¥se (5.79) 


are sometimes referred to as beta weights [87]. 
Further discussion of using regression in correlation form will be given in Chapter 9. 


Example 5.10 (Orthogonal variables) When X has special structure it is often 
possible to obtain an analytic solution to the least squares equations. A particularly 
important example occurs when the columns x;,0 € j € m (for notational convenience, 
we denote the column of 1’s in (5.15) by xo) are orthogonal; i.e., (xi, xj) = 0,1 # j. Then 
using (5.36), XTX is the diagonal matrix diag((xo, хо), (x1, x1) ,..., (Xm,Xm)). Then, 


Е 1 1 1 
хтх\ dia (ym mE) 
i (Xo, Xo) (X1, X1) (Xin, Xm) 
diag (бо, 61, убу) А (5.80) 
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Thus, 
B = (X?X)'xTy 
бо 0 Pd 0 (Xo, y) 
0 б\ се 0 (x1, y) 
ma Ls M E us 631 
0 0 -o m (Xm, у) 
so that 


(xi, Xi) 
which requires only the formation of a number of inner products and no explicit solution 
of linear equations. For this and other reasons it is often desirable to obtain one's data so 
that the column's of X are orthogonal (these are usually called orthogonal designs). This 
is frequently possible in the case of planned experiments, but is rarely the case for data 
that is collected outside of the laboratory. However, the possibility of “orthogonalizing” 
X suggests a number of alternative approaches to solving the least square equations. 
When the orthogonalization is done using the Gram-Schmidt process this amounts to 
using the QR decomposition of the design matrix X. 
As a consequence, writing 
X= QR (5.83) 


as in (4.217), the least squares estimate Ê of 8 in (5.16) is given by 
B= (QR) QR] (ов) у = (R7Q™QR) "отну (5.84) 
Using the fact that Q has orthogonal columns, QTQ =I, so that 
B = (RTR) ` QTRTy = (RTR)  Q7z, z = Ry (5.85) 
Recalling that R. is the Cholesky factor of XTX we can obtain 8 by solving 
(RTR) 8 = 972. (5.86) 
Letting & = ЕЙ, В сап be obtained by solving the two triangular systems 
RTâ = QTz, КВ = à (5.87) 


by forward and backward substitution. As noted in Chapter 4, since the QR. decom- 
position can be obtained stably by repeatedly applying orthogonal matrices to X, this 
procedure allows one to compute B without forming ХГХ or doing matrix inversions. As 
a consequence, this approach is one of the preferred computational methods for obtaining 
the least squares estimate 8. 


Example 5.11 (The canonical form of the regression model) Another important 
form of the regression model using orthogonality is the canonical form (sometimes called 
the principal components form). 


5.3. LEAST SQUARES ESTIMATION 197 
Since XTX is a symmetric matrix, by the spectral theorem there exists an orthogonal 

matrix Q whose columns are eigenvectors of XTX, i.e., 
97 (X7X)Q=A (5.88) 


where А = diag(Ao, À1,..., Am) and А;,0 € < m, are the eigenvalues of XT X. Letting 
G = XQ, then (5.88) becomes 
GTG-A (5.89) 


where the columns of G are given by 
Yi = Ха, 0<1< т (5.90) 


where q;, 0 < i < m, are the columns of Q. From (5.89) GTG is the matrix (vo v)],0 < 
i,j < m, which is diagonal because (Vis V5) = 0,1 # j (ie., the columns of G are or- 
thogonal). In fact, 


(Yin 7;) = (Kaj, Xa) = (Х7Ха;,9;) = № (ai, aj) = 0, i j (5.91) 
because q;, q; are eigenvectors of XTX. Also, 
(Yin Vi) = (Kaj, Xq;) = (X^ Xq,,;) = № (qi; qi) = № (5.92) 


since (q;,q;) = 1. 
The canonical form of the regression model is then given by 


Y =X6+e=XQQ’B+e=Gyte, x = Q7 8 (5.93) 


and the least squares estimates of x;, are given by 


Xi = (VY) / A, OS i т, (5.94) 
as follows from (5.92). Hence, 
x = А-СТу (5.95) 
and it follows that . 
B = Ох. (5.96) 


To verify this, consider 
ХТХОХ = X?TXQA7'G’y = X?XQA7'Q?X7y 
XTX (XTX) xTy =X’y, (5.97) 


so [2 is the unique solution to the normal equations. We will find this form of the 
regression particularly useful in Chapter 9. 


5.3.3 Numerical Examples 


We now turn our attention to giving a number of practical numerical examples of the use 
of multiple regression analysis. As for simple linear regression, the process of building 
a regression model is an iterative one. Usually one begins by using physical, economic 
or other knowledge to determine appropriate explanatory variables x;, 1 < i < т, for a 
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given measured response Y. To determine if the response varies linearly with x; one can 
then make scatter plots of y against x; - generally these will be of similar appearance to 
those for simple linear regression. For those variables which appear to be useful predictors 
a linear model (5.11) is postulated and then fit (at least initially) by least squares. This 
is then followed by looking at goodness of fit, statistical significance, residual plots and 
other diagnostics to determine an adequate final model. 


Example 5.12 (Housing data) To predict housing prices, the data in Table 5.1 were 
gathered. A scatter plot, Figure 5.2, of price y in dollars against square footage (|) 
shows a clear upward linear trend. 

The plot of price against age (x2) in years, Figure 5.3, while not as precise, shows a 
generally decreasing price as the age of a house increases. On the basis of these plots we 
consider a linear model 

Y = f, + 8111 + бтз +E 


to estimate housing price as a function of square footage and age. The model was fit to 
the data in Table 5.1 and the coefficient estimates were 


Bo = 13,239, B, = 60.589, 8, = —1,726.8. 
The fitted model is then 
й = 13,239 + 60.5892, — 1, 726.822. (5.98) 


From this we see that housing prices are approximately $60.59 per square foot while 
the price of a house declines at about $1,727 for each year of age. These figures are 
in accordance with the behavior of the scatter plots in Figures 5.2-5.3 and the three- 
dimensional plot in Figure 5.4. 


Table 5.1 Housing Price Data 
Obs. No. Square ft (xı) Age (x2) Price (у) 
1 


1 1, 800 120,000 
2 1,650 7 110,000 
3 2, 750 12 150,000 
4 1, 550 8 90, 000 
5 2, 750 0 175,000 
6 1,400 10 65,000 
7 1,250 4 78,000 
8 1,250 0 100,000 
9 2,250 13 131,500 
10 2, 300 12 136,100 
11 2, 750 0 184,000 
12 2, 600 1 164,500 
13 2, 850 19 160,000 
14 2. 100 4 132,500 
15 1,800 5 117,500 


Example 5.13 Although most authors in regression analysis generally recommend 
using scatter plots in the beginning stages of model building, they are not always reliable. 
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Figure 5.2: Scatter plot for price (y) versus square feet (ту) 
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Figure 5.3: Scatter plot of price (y) versus age (тә) 
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Square feet (x4) 2400 шш 


Figure 5.4: 3-D Scatter plot for Housing data - y versus тү and тә 


Ап interesting example is given by Montgomery, Peck and Vining [87] (See also Daniel 
and Wood [22]). They constructed scatter plots for the data given in Table 5.2. 


'Table 5.2 А Data Set 


T I2 y 
2 1 10 
3 2 17 
4 5 48 
1 2 27 
5 6 55 
6 4 26 
7 3 9 
8 4 16 


The scatter plots of y against xı and т» are shown in Figures 5.5 and 5.6 respectively. 
Figure 5.5 shows a general random scatter while Figure 5.6 shows a general linear trend. 
To check these impressions we fitted separate least squares lines to each of these sets of 
data with the following results. For Figure 5.5 


Bo = 25.57, to = 1.99 
бү = —0.571, tı = —0.20 


leading us to accept the hypothesis that 6, = 0, while for Figure 5.6 


By = 1.34, = —0.14 
B,=8.101, tı = 3.23 


leading us to accept the hypothesis that 8; Z 0. 
Those results support our visual impression and might lead us to propose 


Y = 25 +815 +€ 
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Figure 5.5: Scatter plot for y versus ту in Ex. 5.13. 
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Figure 5.6: Scatter plot for y versus 22 in Ex. 5.13. 
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as a reasonable model for this data. To check this we fitted the model 
Y = bo + 6,21 + 8512 + € 
to the data. The results were 
By = 8.00, 8, = —5.00, 8, = 12.00 


and the fit was perfect since R? — 1.00 (see Section 5.6). This is not surprising in light 
of the fact that the data in Table 5.2 were obtained by choosing points on the plane 


0 = 8-—527,+ 1229 


which is a totally different model than suggested by the one-dimensional scatter plots. 

Although these results indicate that scatter plots can be confusing for understanding 
multivariate data, we will continue to use them where appropriate. In Chapter 6 we will 
discuss other plots of multivariate data which have behavior similar to that of scatter 
plots of univariate data 


Example 5.14 (Drink delivery data) In Examples 3.4 and 3.10 we examined the 
relation between the time to deliver drinks and the number of cases delivered. Although 
a linear model appeared reasonable. It was noted in Example 3.4 that observations 9 and 
22 appeared to be outliers and there was some skew to the residuals. This suggests that 
perhaps other variables might be useful to improve the model. In this regard, further 
data was obtained on the distance (in feet) that the delivery person had to walk. That 
data are given in the “feet” column of Table 5.3 along with the data from Table 3.5. 


Table 5.3 Drink Delivery data 
Obs. Cases Feet Cases Feet Time 


No. Ti Z9 35 Z9 y 

1 т | 560 6 462 19.75 
2 3 220 9 448 24.00 
3 3 340 10 776 29.00 
4 4 80 6 200 15.35 
5 6 150 7 132 19.00 
6 7 330 3 36 9.50 
7 2 110 17 770 35.10 
8 7 210 10 140 17.90 
9 30 1460 26 810 52.32 
10 5 605 9 450 18.75 
11 16 688 8 635 19.83 
12 10 3215 4 150 10.75 
13 4 255 


A scatter plot of time against distance shown in Figure 5.7 suggests a linear trend. 
This, in combination with our previous analysis, suggests a linear model 


Ү т Bo + 8121 + BoxX2 + € (5.99) 
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Figure 5.7: Scatter plot of distance (y) versus time (тә) 


for this data. As a consequence the data in Table 5.3 were fit using least squares. The 


results were А 2 " 
o = 2.341, 8, = 1.6159, 8, = 0.014385. 


One should note that the estimate of the slope for the effect of cases has changed 
substantially from its estimate 2.1762 when "distance" was excluded. This suggests that 
the estimate 2.1762 was biased by the omission of “distance.” We shall return to this in 
Chapters 6 and 8. 


Example 5.15 (Birth weight data) To illustrate the use of а dummy variable in 
regression analysis we reconsider the birth weight data discussed in Examples 3.6 and 
3.18. As noted in Example 3.6 Figure 3.9 shows that the residuals break up into two 
groups, the first underpredicting and the second overpredicting - а similar behavior is 
shown in Figure 3.9 where the residuals appear to fall along two parallel lines. Again, 
this suggests a missing variable. When the residuals are labeled by the sex of the child, 
it now appears that accounting for this might improve the model. 

To do this we define a dummy variable 


1, if child is male, 
"Em { 0, if child is female. (5.100) 
We then consider the model 
Y = bo + 8,21 + Botat+e (5.101) 
to account for the observations. From (5.101) we arrive at two models; 


Yr = Во + 8х1 + є; for females, 


so that 5, is the difference between male and female weights for the same gestation age. 
Geometrically, (5.102) represents two parallel lines, with possibly different intercepts. To 
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account for possibly different slopes we include an interaction term 41122 as in Example 
5.4 Then, our expanded model for the birth weight data is 


Y = B + B4z1 + 8,22 + дзтітә + €. (5.103) 


тз 


This model was fit by least squares to the data in Table 3.7 where the first 12 children 
are males and the last 12 are females. The coefficients are given by 


Bo = —2142, 8, = 130.4, 8, = 1042, 8, = —22.73. 
Further analysis will be given later. 


There are a number of data sets in the statistics literature which have become well 
known because of various difficulties associated with their analysis. One of these is the 
Hald data [49] given in Example 5.16. Another is the Longley data [76] discussed in 
Example 5.17. 


Example 5.16 (Hald data) These data involve a study discussing the heat evolved 
in the hardening of cement. The variables are: 


y = heat evolved in calories/gram of cement, 
zı = percentage of tricalcium aluminate, 

то = percentage of tricalcium silicate, 

тз = percentage of tetracalcium aluminoferite, 
z4 = percentage of dicalcium phosphate. 


As one can easily verify, from Table 5.4 ту + 22 + z3 + 24 œ 100 indicating that а 
possible collinearity exists among the predictor variables. The consequences of this will 
be investigated as we proceed. 


Table 5.4 Hald’s Cement data 
Obs. No. T1 T2 T3 T4 yY 


1 T 2% 6 60 78.5 
2 1 29 15 52 74.3 
3 11 56 8 20 1043 
4 11 31 8 47 87.6 
5 т 92 6 33 95.9 
6 11 55 9 22 109.2 
7 3 7l 17 6 102.7 
8 1 31 22 44 12.5 
9 2 54 18 22 93.1 
10 21 47 4 26 115.9 
11 1 40 23 34 83.8 
12 11 66 9 12 113.3 
13 10 68 8 12 109.4 


In [49] it was suggested that a linear model 
Y = Bg + 6,21 + 522 + +8323 + Вата +E (5.104) 


would be appropriate to model the data. Hence (5.104) was fit using least squares giving 
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a 


Bo = 62.405, 8, = 1.5511, 68, = 0.5102, 


a 


Ёз = 0.019, апай, = —0.1441. 


Example 5.17 (Longley data [76]) Suppose that we wish to determine the derived 
employment using regressor variables chosen from the Bureau of Labor Statistics. The 
variables are: 


y = Total derived employment (in thousands), 

x, = GNP implicit price deflator; 1954 = 100, 

то = GNP, Gross National Product (in millions of dollars), 

x3 = Unemployment (in thousands of persons), 

z4 = Size of Armed Forces (in thousands), 

zs = Noninstitutional population 14 years of age and over (in thousands), 
тв = Time in years. 


Table 5.5 Longley data 
I| I2 I3 T4 T5 T6 y 
83.0 234,289 2,356 1,590 107,608 1947 60, 323 
88.5 259,426 2,325 1,456 108,632 1948 61, 122 
88.2 258,054 3,682 1,616 109,773 1949 60,171 
89.5 284,599 3,351 1,650 110,929 1950 61,187 
96.2 328,975 2,099 3,099 112,075 1951 63, 221 
98.1 346,999 1,932 3,594 113,270 1952 63,639 
99.0 365,385 1,870 3,547 115,094 1953 64, 989 
100.0 363,112 3,578 3,350 116,219 1954 63, 761 
101.2 397,469 2,904 3,048 117,388 1955 66,019 
104.6 419,180 2,822 2,857 118,734 1956 67,857 
108.4 442,769 2,936 2,798 120,445 1957 68,169 
110.8 444,546 4,681 2,637 121,950 1958 66, 513 
112.6 482,704 3,813 2,552 123,366 1959 68,655 
114.2 502,601 3,931 2,514 125,868 1960 69, 564 
115.7 518,173 4,806 2,572 127,852 1961 69, 331 
116.9 554,894 4,007 2,827 130,081 1962 70, 551 


One considers that the following multiple linear regression model 
Y = Bo + 8121 + 8322 + +8323 + 8424 + 8575 + 8626 + € (5.105) 


would be appropriate to fit the data. Hence (5.105) was used to find the least squares 
estimates, and the regression coefficients are 


By = —3482259, Ê; =15.1, 0, =—0.03582, Ê; = —2.02, 
4 = —1.03, Bs = —0.051, and Bg = 1829.2. 


As we see the nature of the problem is economic involving a price index, gross national 
product, uneployment and so on, so the explanatory variables seem to be naturally highly 
correlated. 
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5.3.4 Estimating о? 


If the errors are N (0, ge). then the MLE of o? can be obtained by differentiating the 
likelihood function (5.17) with respect to c? as for simple linear regression. Carrying 
this out (see Exercise 5.3) gives 


: 1 " 
ÔMLE = - 5 (yi — 9) (5.106) 
i=1 
where А 
ER (xB) . icu kw (5.107) 


is the estimate of E (Y;),1 <i € n. Similarly, the MLE of ø is given by 


2, 1/2 


` 1 T 
OMLE — E p» (yi = jj). 


i—1 


(5.108) 


Since y; — f; = ё; is the i-th residual, (5.106) is just the average value of the sum 
of squares of the residuals. (As in Chapter 3, the sum of squares of the residuals 
Y 1(yi—$) is denoted by SSE.) In general, (5.106) is not used to estimate о? 
since it is biased. Rather, the unbiased estimate 

2 SSE 


шие с (5.109) 


is the standard estimate of c? and s = Vs? is the customary estimate of с (Note: s is 
generally a biased estimate of с as was shown in Section 3.3.). In the more general case 
where the errors are uncorrelated and have constant variance o?, but are not necessarily 
normal, (5.106) and (5.108) are also the standard estimates of о? and с respectively. 
Again the justification comes from the unbiasedness of s? as will be demonstrated in 
Section 5.4. 

An alternative expression for SSE may be derived from Equation (5.20). Since SSE = 


g (8) where 8 is the least squares estimate of B, we have 


8 (8) = (у,у) – (x7, x7 B) = (у,у) – (5. XTX&à) = (у,у) – (8, XTy). (5.110) 


Eq. (5.110) shows that SSE can be evaluated without forming the residuals. This 
may be computationally more efficient in some circumstances, since X^ y will have al- 
ready been obtained in order to solve the normal equations. 


5.4 Properties of (8. 8,8) 


In this section we will establish a number of useful properties of the least squares es- 
timator В, the estimate of variance s? and the residual vector ё. These generalize the 
results obtained in Chapter 3 for simple linear regression and are fundamental for the 
development of test procedures when the errors are normal. We first consider properties 
of B and s?. 


5.4. PROPERTIES OF (B, 5,2) 207 


Theorem 5.4 Let B be the least squares estimator of В in the full rank GLM. If 
the errors ci, 1 <i <n, are independent with constant variance a7, then 


(i) Е (2) = B (B is unbiased); 

(ii) X (8) =o? (XTX), 

(iii) E (s?) = a? (s? is unbiased); 

(iv) In addition, if &;,1 € i € n, are independent N (0, g^) random variables, then B 
has a multivariate ЇЧ (8. с? (X?X)~*) distribution. In particular, each В; has a 
N (Bi 0?6;) where 6; is the i-th diagonal element of (Хх). 


Proof. (i) From (5.22) 8 = (XTX) XTY so that 


z (8) 


Hl 


E (хтх) XTY] = (X"X) " XTE (Y) 
= (XTX) XT7Xg- B. (5.111) 
(ii) From (5.22b), 8 = AY where А = (XTX) XT, which shows that д is a linear 
combination of the Y;'s. From Theorem 4.15 the variance-covariance matrix X (8) of 
В is given by 
x (B) = AE(Y) AT. (5.112) 
But, (Y) = 021, and AT = X (XTX) ^ so that 


1 


х (Ê) = 0? (XTX) XTX (XTX) = 0? (XTX). (5.113) 


(iii) The proof that E (s?) = с? is a little more complicated than (i). We begin by 
obtaining an expression for the residual vector ê = Y — хд. 
Since Y = XÊ = X (XTX) XTY 


è = Y — X (XTX) XTY = 1, -X(XxTX)' xT] Y-(L-H)Y (5.142) 
where Н = X EX XT is called the hat matrir. Now note that (L, - H) X — 
X-X(XTX) ! XTX = X - X = 0, so that 


ê = (L -H)Y = (In - H) (X8 + €) 
(I, — H) X8 + (In - H) e = (L, – Н) є. (5.114b) 


Now observe that HT = H and 


H? [X (хх) xT] [X (XTX) хт) 


| 


х [(хтх) XTX (хтх) X7| 


X(XTX) x7T-H (5.115) 
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and 

(In – Н)? = (I, — H) (5.116) 
(so that H is a symmetric idempotent matrix). Thus, 


т 


3 (yi — 0) = (8,8) = (In – H) e, (In - Н) є). (5.117) 
But, | 
(1, Н) є, (1, -H)e) = (e, (I, — Н) (I, - H) e) 
= (e,(I,—H)e)= 2x gijer; —— (5.118) 


where gi; is the 7j-th element of L, — Н. 
Now taking the expectation of (e, (I, — Н) =) using the fact that &;,1 < i < n, are 
uncorrelated gives 


Е|(є,(1„-Н)є)) = ,.»,9;E(cie;) 


ј=1 i=1 


— ` X gigCov (=;=;) (because E (e;) = 0) 


j—1 1=1 
= У g:iVar (&i) = 8^» du (5.119) 
i=l i=1 


We now need to evaluate У)? , фу, which is the trace of I,, — Н. Using the properties 
of the trace given in (4.31)-(4.32) 


tL, -H) = »-t|X(X7X) XT] =n- tr (Хх) (XTX) | 
= n-tr(Im41) —n— (m- 1). (5.120) 


си сва 
Е (52) = Е — LMENT oO 2 


(т + 1) n— (т1®-+1) = ua) 


(iv) Since each coefficient 8,,1 <i < m+1 is a linear combination of the independent 
normal random variables Y;,1 < i < n, it then follows that [3 has a joint multivariate 
normal distribution. From (5.111) the mean vector is 8 and from (5.112) the variance- 


covariance matrix is c? (XX Thus, E (8.) = В; and Var (4,) = с26;,1<1< т. 
Непсе B, isa N (Bj, c?ó,) random variable. M 


We emphasize that properties (i)-(iii) of Theorem 5.4 are true even if =;,1 < à < л, 
are only independent with common variance o?. Normality is not needed. However, if 
the errors are normal, further important distributional properties of and s? may be 
obtained. The proofs of some of these properties are somewhat technical and students 
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and instructors wishing to omit them may do so. However, these additional properties 
will be used throughout the text and so should be learned even if the proofs are not. 


Theorem 5.5 Let Y = X + € be a full rank linear model where €;,1 <i < n, are 
independent М (0, о?). Then, 


(1) 8,,0 < i € m, and s? are independent random variables; 
(ii) (n — m — 1) s?/o? has a chi-square distribution with n — m — 1 degrees of freedom. 


iii) If ô; is the i-th diagonal element of (XTX P then 
(iii) 


qiiem (5.122) 


has a t distribution with т — m — 1 degrees of freedom. 
(Note: Since Var (2) = 076;, 8/6; is an estimate of с (8.). s/o; is usually 
called the standard error of B;. It will be denoted by 6 (2) .) 


(iv) Let C be an r x (т + 1) matriz of rank r. Then the quadratic form 


Й : (c (ô = 8) | С на. or] С (8 Е 8)) "— 


has a chi-square distribution with r degrees of freedom. 


Proof. (i) Consider 


B = (XTX) XTY = (XTX) XT (X8 +e) 
= B+(XTX) XTe (5.124) 
so that : 
B- B= (XTX) Xe. (5.125) 


R T 
From (5.114b) ê = (1, — Н) e so the vector (8 — 8| e) can be written as 


Z- (25) Е [REV = s E (5.126) 


L,—H 
where Z is a function only of £&;, 1 < 1 < n. Hence it follows that Z has a degenerate 
multivariate normal distribution. The independence follows from Theorem 4.15 if we can 
show that Cov (ba ês) = 0,1<1<т,1< 73 < п. To do this we calculate У (Z). Using 
(5.126) we find that 


x(Z) = E xX(e)[ АТ | B? | 
= 07 EI сарве | s = | (5.127) 
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From the definition of Z, the matrix ABT = [Cov (8.8)] ‚0<1:<т,1<]<п,в5о 
that 
1 


АВТ = (XTX) XT |in -X (XTX) xT] 
e (KEK) KA OK) a 
(XTX) XT — (XTX) XT =0. (5.128) 


(ii) For this we observe that (n — m — 1) s?/o? = У), (yi — 0)? /o? = (e, (In — Н) e) 
/o*, which follows from (5.117). Since I, — H is symmetric and idempotent, it follows 
from the spectral theorem (Theorem 4.11) that 


I, -H = QAQ? (5.129) 


where Q is an orthogonal matrix whose columns are the eigenvectors of I,, — H and A is 
ап п х п diagonal matrix whose diagonal elements are the eigenvalues of L, — Н. Since 
the eigenvalues of L, — Н are either 0 or 1, and tr(L, — Н) = rank(I, — Н) = n-m-1 = 
tr(A), A can be written in the form 


I,-m- 0 
A= | 0 : Di (5.130) 
Hence, 
(e, (In - Н) є) = (e, QAQ’e) - (Qe, AQE). (5.131) 


Since QT is an orthogonal matrix, letting q = Qe it follows from Theorem 4.15 that q 
is a vector of independent N (0, c?) random variables. Thus, 


A n—m-1 z 
‘a. Aa) ue 2 (5.132) 


i=1 


is the sum of the squares of n — т — 1 independent N (0, 1) random variables and so is 
X? with n — m — 1 degrees of freedom. 


(iii) From (iv) of Theorem 5.4 we know that 8, — B; is N (0, o76;) so that Q — 6) 
/o Nó; is N (0,1). Now 
Q – 8) Е (8, -&) /o v ői 
sVói = s/o 


and from (ii) s/o is the square root of a x? (n — m — 1) random variable divided by its 


T; = (5.133) 


degrees of freedom. From (i) (8, — 8) / o Vó; and s/o are independent, so it follows 
from Section 2.8.3 that T; has a t-distribution with n — m — 1 degrees of freedom. 
(iv) To prove (5.123) let W = CÊ and note that W is N (CØ, о2С (ХХ) ' СТ). 


Thus, CÓ — CB is N (0,07C (XTX) c7), so we need only establish the following 
fact. Let z be an r dimensional N (0, £) random vector, then 


д = (z,X /c (5.134) 
is x? (r). 


5.4. PROPERTIES OF (ô, 8°, Э 211 


To see this, we again use the spectral theorem to write 
== RR? (5.135) 
where R is nonsingular. Let u = В. 1х2, then E(u) = R-1E (z) = 0 and 
X(u)- ВХ (z) (ВТ) = R-' RR? (R7) = o"I,. (5.136) 


Thus, the components of и are independent N (0, c?) random variables and 


Я (2,5-1z) (Ru(R7)"R-Ru) 
eo oof 
(R-RuR Ru) (uu) ы dia 
= ML a =) (=). (5.137) 
i=1 


This shows that д/с? is the sum of the squares of r independent N (0, 1) random variables 
and so is x? (r). M 
5.4.1 Properties of é 


Here we summarize à number of basic properties of the residual vector &. Some of these 
have already been discussed in Chapter 3 for the simple linear regression model and we 
will give suitable generalizations here. 


Theorem 5.6 Let ё = Y — Y denote the residual vector from the least squares 
estimation of B in the GLM where the errors are uncorrelated with common variance 
c?. Then, 


(i) E (ê) = 0; 

(ii) X (ê) = o? (In — H). 

(iii) If є is N (0,021,) then ё has a N (0,0? (I, — H)) distribution. If the model has 
an intercept, that is, Bg #0, then, 

(iv) У? ĉi = 0, and 

(v) 354 éif; = 0. 


Proof. (i) and (ii) and (iii) follow easily from the representation ё = (I, — H) € 
given by (5.114b) and are left to the reader. 
(iv) From Exercise 5.7 or by examining the first of the normal equations 


У |м [Bo У туй; | | =0, (5.138) 


i=l j=1 


where the term in the parentheses is ĝ;. Thus, У у (yi — $;) = 0 as required. 
(v) For this we observe that the normal equations can be written as 


xT (v 2 хд) =й, (5.139) 
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and taking the dot product of (5.139) with à gives 
(B.x* (y - X) ) = (XÀ.y - XB) = (9,8) =0. (5.140) 


But, (9,8) = У, ĉ:ĝi and (v) follows. (Note that (v) does not require the model to 
have an intercept.) M 


Using (iv) and (v) of Theorem 5.6 we can obtain the following decomposition of the 
adjusted sum of squares for a model where 5, Æ 0. 
КӨ эор с жу с, 
i=1 i=1 


——' — m 
SST = SSR F SSE 


(5.141) 


where, as in Chapter 3, SSR is the regression sum of squares. The proof of (5.141), 
as in Eq. (3.141) requires only that У)? ё; = 0 and У, &ij; = 0 and so needs по 
further justification. As for simple linear regression, we will use this identity to introduce 
a generalization of R? in Section 5.6. 


5.4.2 Further properties of X (8) 


Because the variance-covariance matrix У) (8) of B plays such an important role in 


regression analysis we will now present several examples of its computation. For most 
numerical examples, this needs to be done with а computer. However, our emphasis 


here will be on the development of a number of alternative expressions for >; (8) arising 


primarily from various special forms of the regression model. 


Example 5.18 (X В \ for the simple linear regression model) To illustrate some 


of the power of matrix methods in doing regression calculations we will rederive the 
expressions for the variances of 65 and д} in the simple linear regression model. 
We note from Example 5.8 that in this case 


X'xX 2 4 0 ian | 
| Б у ре 2. 


and using Cramer’s rule to find the inverse of a 2 х 2 matrix we find that 


(5.142) 


а a? а 
OX)" -xuxmg| BE CU | a) 
where 
n n 2 n 1 n 2 
det (Хх) = п) vi Е у, т) C ра т ^^ Do о) | 


n ua (ж — т)? ELS us (5.144) 
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Thus, 


Е NS xx Е кы Ti n 
From Theorem 5.4 we find that o? @ ‚1 = 0,1, are given by o76;,i = 0,1 where 
ôi, i = 0,1 are the i-th diagonal elements of (X0 Thus, 


(xig ee. | а ЕА (5.145) 


2 


о? (40) a x (5.146) 


Tias ^ 
i=1 


and 3 
o? (1) = = (5.147) 


From (5.147) we see that the expression for с? (8 1) agrees with that given in Theorem 


3.1 while that for о? (40) appears somewhat different. A little algebra, however shows 
that they agree. For this we observe that in Theorem 3.1 we found that 


e? (By) = e Ea (5.148) 
But, 
Y. cm Satia- Duque 3f5 
— = —-———— <—< St ЭЙ й — 5.149 
n бш nS, ise S (%) d 


as follows from the previous paragraph, so the two expressions for о? (%) agree. 


In addition, we note from (5.143) that Cov (Bo, By) m0 y t uns жЕ тб; 
which agrees with the expression found in Chapter 3 (and by a somewhat less tedious 
calculation). 


Example 5.19 (Spectral analysis of о? (8.)) For some purposes, particularly for 


the analysis of multicollinearity, it is useful to see how the variances of B, depend on the 
eigenvalues of XX 
For this we use the spectral theorem, so that XTX = ОЛО” and (хх) = 


QA~'Q™. Now letting ау = [ф;] denote the j-th column of Q, then 


-1 _ dod. |Ә 
a Eb x d 
= E 0<4ј<т, (5.150) 


апа (Q*);, = qji,0 € i,j € m. Thus the i-th diagonal element of QA !Q is given by 


Y (2) (Q^, = 5 (2) dik (5.151) 


k=0 
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so that 
(QA^ Q7), = ` А. (5.159) 
so that B 
Tx] = 3 th (5.153) 
o? (4;) = 02 2 ss. (5.154) 


From (5.153) we see that if any of the eigenvalues of X^ X is very small, then the 
variances of all the B,’s may be large and so are estimated with poor precision. Since, 
as we will show in Chapter 9, it is the existence of small eigenvalues of XTX that 
characterizes multicollinearity, then large values of 6; are suggestive of multicollinearity. 


In Chapter 9 we will argue that when the regression is done in correlation form 
that values of б; > 10 (in this case 6; is called the variance inflation factor (VIF)) are 
suggestive of possible problems due to multicollinearity. Since most known computer 
packages give the VIFs as standard output, for now, we will monitor collinearity by 
examining these quantities. In Table 5.6 we list the VIFs for the data discussed in 
Examples 5.12-5.17. 


Table 5.6 Variance Inflation Factors (VIFs) for Examples 5.12-5.17 

Data in Examples 
Housing prices data (Ex. 5.12) 
Drink delivery data (Ex. 5.14) 
Birth weight data (Ex. 5.15) хз in 
Birth weight data (Ex. 5.15) x3 out 
Hald data (Ex. 5.16) 
Longley data (Ex. 5.17) 


399.2 759.0 


From Table 5.6 we see that both the Longley and Hald data have a number of large 
VIFs so we anticipate difficulty in interpreting the models proposed in Examples 5.12 
and 5.14. The housing and drink delivery data appear to be free of multicollinearity 
problems while the birth weight data appears contradictory. (Can you suggest what the 
problem is?) Further discussion of this matter will be given in Chapter 7. 

As we shall see, the presence of this phenomenon can confound the statistical interpre- 
tation of the estimated coefficients G;. As we proceed with our discussion of estimation 
and testing the reader should keep this relation in mind when confronting apparently 
contradictory results of a particular analysis. 

Another useful result, derivable from (5.113) is an expression for the total variance 


of B, bu (&) = Var (8). Since c? (4;) is 076;, then 


m 


Var (B) = o? Y å; (5.155) 


i=0 
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But gor = а (хх). From (4.141) tr[(X7x) ^] = tn( At pes VIS 
that 


Var (8) = x o? / X. (5.156) 


Example 5.20 (£ (8) for orthogonal models) When the model is orthogonal © (8) 
is particularly easy to obtain. In this case, since the columns of X are orthogonal 


XTX = diag [(xi, x;)] (5.157) 


where x; is the i-th column of X. Thus, 


D (8) = c?diag | (5.158) 


and from this we find that ĝ; are uncorrelated and so in the normal model are independent 


and 
2 


о? (8) = ко O<i<m. (5.159) 


хх 77 


which сап be obtained without any matrix inversion. 


Example 5.21 (Computing X (8) from the centered and scaled models) Because 
one often uses the centered and/or the centered and scaled forms of X to do calculations, 
it is important to see how one can obtain Ў; (8) in terms of the quantities X, and Kg. 
We begin with the centered model first by noting that 


(хтх) = | о | (5.160) 


where 
y= (-Ba™ + 1) /n, B=~a (XTX,) /n, 


D-(X7IX', a=1X,, 
where 1 = (1,1,...,1) is an n-vector, and X, is the matrix of columns 1 to m of X. 
Thus, c? (Э ‚1<4 < т, is given by 


с? (5) = о? (К, Сёз, (5.161) 


апа 
0? (Bo) = 7. (5.162) 


We note that these can be computed in terms of quantities determined in the course of 
solving the centered normal equations. 

In the case of centered and scaled variables we have from Example 5.9 that Х„ = 
X,.S-! so that XTX, = S-1X7,X,,S-! so that (ХГХ) = S(X,,X,,) 18. Then 
о? (8;),0 <i < m, can be obtained from (5.161)-(5.162). 
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5.4.3 A Summary of OLS Estimators 


Because the properties of В, & and s? that we have proved in the previous subsection are 
so important for the further development of regression analysis we feel it is worthwhile 
to gather them here in Table 5.7 for easy future reference. 

In particular, those readers who have skipped the proof of Theorem 5.4 may simply 
refer to this section for the needed results as they arise in further work. We assume that 
we are considering the full rank GLM in the form (5.16) where the errors are uncorre- 
lated with common variance o?. For distributional properties we make the additional 
assumption that e; ~ N (0,02) ME 


Table 5.7 Summary of Properties of Ordinary Least Squares Estimators 


B-(XTX) XTy 
B s 
(least squares x(8|-2oc? (XTX) Var (2) = 0205 
timat f = 
о where б; is the i-th diagonal element of (XTX) : 


[=X 


Y 
(point estimator X(Y)2-9H,H-X (XTX) XT, Var (Ӯ) = 07 hi, 


of E(Y)) where hi; is the i-th diagonal element of Н 


é=Y-Y 
(residual vector) 


2 n — m — 1) s?/c? is x? (n — m — 1) 
s? is independent of 6;,i = 0,1,..,m 
(2) = 84/5; is the standard error of B, 


8 
(estimator of с?) 


s fih has a t-distribution with df = n — m — 1 


i 


-H) =1, —H (symmetry) 
н Н? = H, (І, – Н)? =1„—Н (idempotency) 
(the hat matrix) Hx —x 
(H) = Soi hi=m+1, 1/n <ha<1 
H(L, —H) = (1, -H)H=0 
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5.5 Тһе Gauss-Markov Theorem 


As we have pointed out on several occasions, the least squares estimator of 3 in the 
GLM coincides with the MLE when the errors are independent and N (0,07). Since the 
least squares estimator is known to be efficient in this case, that is, it is the minimum 
variance unbiased estimator (MVUE) of В, it is interesting to determine what optimality 
properties are present if the errors are not normal. 

As for simple linear regression the least squares estimator of {3 is the minimum vari- 
ance unbiased linear estimator if only the errors are uncorrelated and have constant 
variance. This is the celebrated Gauss-Markov theorem and it is frequently invoked to 
justify the use of the least squares estimator even when the errors are not normal. Even 
though the least squares estimator is the “best linear unbiased estimator” (BLUE) there 
may be better linear biased estimators or even better nonlinear estimators. Some of these 
are discussed in [66, 87]. Notwithstanding, the use of these other estimators is not as 
widespread as the least squares estimator and their properties are not as well developed. 


Definition 5.1 Let B be the vector of regression coefficients in the GLM. We say 
that 8 is a linear estimator of B if each B; is a linear combination of the observations 
Y;, 1 <i <n, so that 


д, = Qon 0xiszm. (5.163) 


If A = [a;;] then (5.163) can be written as 


^ 


B = AY. (5.164) 


Observe that if the model (5.16) has full rank, then the least squares estimator is linear, 


since . » 
B-(X'X) XTY (5.165) 


and B — AY, where A — xXx XT. 


Theorem 5.7 (Gauss-Markov theorem) Consider the full rank GLM Y = X + e, 


where €;,1 < i < n, are uncorrelated and have common variance o?. Then the least 


squares estimator В of В is the minimum variance unbiased linear estimator of В. (B 
is often referred to as the BLUE estimator.) 


Proof. Let B = AY be a linear estimator of 3, then in order that 8 be unbiased 
we must have E (8) = 8 and this gives 
E (8) = AE (Y) = АХВ = B. (5.166) 


Thus, (AX — Im) 8 = 0. Since this must hold for all vectors 8 € R™*!, AX —Im+1 = 0, 
or equivalently, 
AX =Тм+1. (5.167) 


We now compute the variance-covariance matrix of 8. From (5.112) 


x (8) = AX (Y) AT. (5.168) 
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and by assumption © (Y) = o?L,, so that 
У (8) = o? AAT. (5.169) 
Now write 
TxV yT 
A-(X'X) X' +D (5.170) 


so that D represents the difference between A and the matrix for the least squares 
estimator. We will have proved the theorem if we can show D = 0. 
Substituting this expression for A into (5.169) we get 


= (ô) 


|| 


e? (XX) "x^ +р|(х(х”х) *+D7| 


e? [(X7X) " + DX (X7X) "+ (XTX) XTD" + DD] (5171) 
Similarly, using the unbiasedness condition, 
x73)" ХТ + D ХЕТТ, (5.172) 


so that DX = 0, and taking the transpose of this, X D7 = 0. Using these two conditions 
in (5.171) shows that 


x (B) = 0? (хт) " g рр”! | (5.173) 


Now examining the diagonal elements on the both sides of (5.173) gives 
Var (2) = о? |6; + У; d. (5.174) 
= 


Since 6; > 0 and 577 , > 0, the variance of B; is minimized by setting o;o dij = 0. 


Thus dj; = 0,1 < i < m, and since this holds for all j = 0,1,...,m, we find that 


D = 0. Thus the linear unbiased estimator B which minimizes Var д.) ‚0<1< т,1в 


B= (XTX) XTY which is the least squares estimator. M 


5.6 Testing the Fit - the Basic ANOVA Table 
5.6.1 The Overall F-Test 


As in the case of simple linear regression, we must regard any model as tentative, and 
having proposed the model, it becomes necessary to determine if the fitted model is 
statistically significant. In this case we would like to know whether at least one of the 
coefficients 8;,1 < i < m, is nonzero which implies that at least one of the variables 
Ti, 1 € i € m, is useful in explaining the observed variation of the observations y;, 1 < 
i «nmn. 

One way of doing this is to test if each individual coefficient 8;,1 < i < т, differs 
from zero; say through some generalization of the t-test used for simple linear regression. 
However, if the number of variables is large, the probability of accepting at least one 0; 
as nonzero is large, even if the significance level of each of the tests is small. Thus, the 
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overall Type I error of this procedure may be unacceptably large unless the significance 
level of each individual test is very small. 

Then we have the risk of accepting coefficients which are nonzero as zero, and thus 
possibly omitting significant variables from the model. 

In addition, in multiple regression, the multicollinearity problem presents us with 
additional difficulties since one of the effects of this is to produce large standard errors 
for the estimated coefficients. This imprecision in estimating individual coefficients may 
result in accepting all of the coefficients as zero, even if the overall fit is statistically 
significant. This will be illustrated numerically in Example 5.27. 

To overcome the problem of performing many individual tests it would be helpful to 
have a single statistic to test the hypothesis 


Ho : By = 8, =- = bm = 0, (5.175) 


against 
Hi: At least one 8; #0, 1<i<m, (5.176) 
and the ANOVA approach to testing developed for the simple linear regression model 


suggests that the F-ratio 
| SSR/m 


82 


F (5.177) 


might be appropriate. In Section 5.8 we shall see that (5.177) is a particular case of a 
general methodology for testing hypotheses about the general linear model. For now, we 
will argue for (5.177) on an ad-hoc basis. 

We begin by showing that 


E (=) = о? + (Kes, Х.д„) (5.178) 


т, 


where X, is the n x m matrix of centered regression variables introduced in Example 
5.9 and 8, = (0:585, 8, X. Since the proof of (5.178) is somewhat involved, the 
reader may wish to skip the details and proceed to the material after Theorem 5.9 where 
(5.178) is used to provide a justification for using F to test for the overall significance of 
the regression. 


Theorem 5.8 Let Z; = $5; ,vij/n be the mean value of the j-th column of X, and 
let X denote the matriz whose j-th column is (2,25, Es) 4° Then in the general 
linear model with uncorrelated homoscedastic errors, with variance с? 


2(S82) . ыйкы кош 


m m 


o? + Keps X45.) (5.179) 
т, 


Proof. We begin by noting that 


й = Bo + Y zu, (5.180) 


j=l 
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and from 255Е/08, = 0, that 


bo — 39 — M zB. (5.181) 
7=1 
Thus, 
й—®У= У (m-z)8,1siszn. (5.182) 
j=1 


Using the definition of X and noting that the 0-th column of X — X is 0, (5.182) can be 
written in vector-matrix form as 


Y-Y-(x-X)B (5.183) 


= (8(х-Х) (X-X)À) -= (B. AB) (5.184) 
where A = (ХХ) (X - X). 
Now from (5.184) and Theorem 4.14 
E (SSR) —tr(AX) + (8, AB) (5.185) 


where © is the variance-covariance matrix of В. We now proceed to simplify A. First, 


observe that 
EX 


X= 5.186 
- (5.186) 
where 
1 1 1 
E= s w a (5.187) 
1 1 1 
is an n x n matrix consisting of all ones. Thus, 
= E 
X-X- (1, — 3 X (5.188) 
n 


and (X — x) = X" (L, — E/n) since E is symmetric. Using the fact that E?/n? = E/n, 
we get that 


i= En) -qo—-——L-—-— 5.189 
giving 


AX (1, — 3 X. (5.190) 
TL 
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Also, 
=f 
АЎ = с? xt (1, Е E) x| (xx?) (5.191) 
n 


and using (4.32) 


tr(AZ) = tr(£A)=0°tr E (XTX!) xX? (1, — 3] 


= а? j- Z] 
= а? len -+ (32). (5.192) 


Since HX = X and (pr. = 1 is the 0-th column of X, H1 = 1, so that 


HE = E. Thus, 
tr (H) — tr (=) tr (H) — tr (=) 


т+1-1=т (5.193) 


I 


so that tr(AX) = mo’. Finally, using this in (5.185) gives 


E (==) aig? 4 (x-X)86(X-X)8) (5.194) 
m m 
Now, (X — X) B = X., because the first column of X — X is zero. Thus, 
(X - X) 8, (X — X) B) = (X... Xcb.) (5.195) 
so that 
E (=) = 0? + {х.6.,Х.8,) (5.196) 
т, т 


as required. BM 


From (5.196) we see that if 8, = 0 then E (SS R/m) = E (s?) = c?. Hence, if B, 7 0, 
then on average SS R/m > o? since E(SSR/m) > о?. Thus, large values of the F-ratio 
(SS R/m) /s? suggest that we should reject the hypothesis Но that f, = 8, =... = Bm = 
0. 


To determine the critical values for rejecting Ho we need to determine the distribution 
of F. In order to do this we will need to make our standard assumptions that the errors 
are independent N (0,0?) random variables. 


Theorem 5.9 Jf the errors in (5.1) are independent N (0,0?) , and if B, = 0, then 
Е has an F-distribution with (m,n — m — 1) degrees of freedom. 


Proof. Note from (5.184) and (5.190) that 
А E. 
SSR- (9. (1, " 3 Y) | (5.197) 
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To simplify (5.197) we note that 
Y = HY = H(18, + X,B, + є) = B,H1 + HX, 4, + He. (5.198) 


Since 8, = 0, HX, 8, = 0 and H1 = 1, HY = 8,Н1 + He. 
Thus, 


Ң E. E 
(s(n.-P)v)-(nieme(n-P)masme) © cum 
Since (L, — E/n) 1 = 0 (5.199) simplifies to 
E 
(вл + He, (1, — =) He) ; (5.200) 
Using the fact that І, — E/n is symmetric and (I, — E/n) I, = 0 in (5.200) 


SSR = (He, (1, = =| He) - (=н (1, E E) He) (5.201) 


Due to the symmetry and idempotency of H (i.e, H? = H) ‚ and the fact that HE = 
EH = E (this follows from H1 = 1) (5.201) becomes 


SSR = (e, Be) (5.202) 
where B = H — E/n. 
Now ЕТ 5 
B” =H’ - — =H- — (5.203) 
n n 
and 
E EH E? 
B? = (n-2)(n-2 Bec 
n n n n n 
эйе м uou emp ән (5.204) 
n n т 


Thus, B is symmetric and idempotent so that rank(H — E/n) = tr(H — E/n) = m+1- 
1 = m. Hence, using the spectral theorem 


а E r| In| 0 
B-H---Q | 0 a (5.205) 
so that i Я i А 
SSR = c | o 0 | Qe) = (® | o 7 | 2 (5.206) 
where 
Z = Qe. (5.207) 
Thus, 


558= У 2? (5.208) 


ї=1 
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where Z; ~ N (0,6?) . (This follows from (5.207) because є ~ N (0, In) and Q is orthog- 
onal.) Thus, Z;/o0 ~ № (0,1) and SSR/o? ~ x? (т). 

To complete the proof we must show that SSR and s? are independent random 
variables. If this is the case, then 

S 2 2 
p- SSEM SSR/me? . — x (m)/m ^ (5.209) 
82 52/0? X? (n — m — 1) / (n т — 1) 

is the ratio of two independent mean squares, hence it has an F-distribution. 

To show the independence of SSR and s? it suffices to prove that SSR is independent 
of the residuals 2;,1 < i < n. Now from (5.202) 


in (iE) 
(S 


W-H (1, — 3 € = Åe (5.211) 
n 


SSR 


Letting 


and 
ё = (In - H)e = Be, (5.212) 


it suffices to show that W and є are independent random vectors. For this, letting 


V=(W|é)* and у= | (5.213) 


and arguing as in Theorem 5.5 it now suffices to show that ABT = 0. Now, since L, – Н 
is symmetric 


AB" -H(1,-=)(,-H)=H(1,-H-=+53), (5.214) 
т, т, n 
Since EH = E we get 

AB’ = H(L, — Н) = 0. (5.215) 
ш 


Using Theorem 5.9 we see that if e ~ N (0,07I,,) Ho is rejected at level o if 
FOP RET (5.216) 


This is the classical F-test for the overall significance of the regression. Although the 
F-test given by (5.216) can be strictly used only if the errors are independent N (0, c?) 
in the GLM, it is generally used even if this is not known to be true. 

As for simple linear regression, this can frequently be justified, at least for large 
sample sizes, by central limit theorem considerations. The F-test developed above leads 
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us to introduce the following ANOVA table as a way of summarizing the appropriate 
statistics. 


Table 5.8 ANOVA Table for the General Linear Model 


Source df Sum of Sqares Mean Squares F 
Regression m SSR MSR = — 5 
"SSE i 
Residual n — (m + 1) SSE MSE = a 
Total n—1 SST 
OR EM. MEME 


5.6.2 The Coefficient of Multiple of Determination 


As in the case of simple linear regression we can think of the F-test, as testing for the 
significance of the coefficient of multiple determination defined by 


SSR SSE 


R= ка а т (5.217) 
because 
F- SSR (=) 
SSE m 
SSR SSE n-m-—1 
- Ger) / r) Ca) 
R? n—m-1 
" m (= ) (5.218) 


which is a monotone increasing function of R?. 

Since it follows from the decomposition of SST given in Eq. (5.141) that 0 < R? < 1, 
as for simple linear regression, large values of R? close to one generally indicate good fits, 
while “small” values suggests poor ones. Because of this, most practitioners of regression 
analysis routinely examine R? as an important output statistic. The reader is cautioned 
however, that R? can be made arbitrarily close to one merely by adding variables to 
the regression model (5.1), even if the variables are statistically insignificant. In fact, 
if we have n variables (linearly independent) and n observations we get a perfect fit. 
This problem of overfitting can generally be detected, because the F-ratio may then 
be insignificant (say at о = 0.05) or only marginally significant. To see why this is 
so, consider SSE/SST in (5.218). Now SST is fixed by the data, but SSE can be 
decreased by adding variables since we then minimize (y — X8, y — XQ) over a larger 
set of 3°. Thus SSE/SST is a nondecreasing function of the number of variables, so R? 
is nondecreasing. Because of this possibility of artificially inflating R? one can modify 
R? in (5.217) by replacing SSE and SST by their mean squares SSE/(n — m — 1) and 
SST/(n — 1) respectively giving the adjusted R?, 

SSE/ (n — m — 1) 

SST/(n—1) ` 
Note that because of the factor n — m — 1 that the addition of а variable does not 


necessarily increase R since it decreases as т increases to compensate for the decrease 


in SSE. 


HR -1- (5.219) 
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As for simple linear regression we complete the ANOVA table, Table 5.8 by adding a 
find row listing the values of R? and R. 


Example 5.22 (ANOVA tables for Examples 5.12-5.17) Here we present the ANOVA 
tables for Examples 5.12-5.17 which will allow us to make further inferences concerning 
the models proposed there. 


(1) Housing data: For the housing price data in Example 5.12 we find that the ANOVA 
table is: 


Table 5.9 ANOVA for Housing Data 
Source df Sum of Squares Mean Squares F 
Regression 2 1.67531 x 1010 8.3766 x 10? 138.99 
Residual 12 7.23185 x 108 6.0265 x 107 
Total 14 1.74763 x 101? 


mL E E UL Lo ME NE 
R? — 0.959 Б = 0.952 


From Table 5.9, F = 138.99 which again is significant at < 0.1% level so we accept 
the hypotheses that at least one of the variables, square footage or age is significant 
in explaining the observed variation in housing prices. 


(2) Birthweight data: The ANOVA table for the model given by (5.103) is: 
Table 5.10 ANOVA for Birth Weight Data 


Source df Sum of Squares Mean Squares F 
Regression 3 1,217,562 405,854 13.26 
Residual 20 612,311 30, 616 
Total 23 1, 829, 873 


a Foe lS €" eee Seen en ae eee 
R? = 0.66 R =0.615 


Again F is significant at < 0.1% level which leads us to accept the hypothesis 
that at least one of 1,22 or 23 is significant in explaining the variation in birth 
weights. From our discussion in Chapter 3 we certainly expect zj(gestation period) 
to be significant but as yet we cannot infer anything about the effect of sex. From 
Table 5.6 we see that the VIFs for Bo and B3 are quite large, suggesting a strong 
multicollinearity between these factors. As will be shown, this problem confounds 
the interpretation of the significance of хо and 23. 


(3) Hald data: The ANOVA table for the Hald data is: 
Table 5.11 ANOVA for Hald Data 


Source df Sum of Squares Mean Squares F 
Regression 4 2,667.90 666.97 111.48 
Residual 8 47.86 5.98 
Total 12 2,715.76 


R? = 0.982 R = 0.974 


As іп our previous examples, F is significant аё < 0.1% level and so we accept the 
hypothesis that at least one of 21-24 is significant in explaining the variation in 
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heat evolved. The large values of R? and Т indicate that essentially all of the 
variation in y can be explained by the model (5.104). However, the large VIFs for 
В, and @, suggests a possible problem in interpreting the estimated coefficients 8., 
and 84. 


(4) Longley data: 
Table 5.12 ANOVA for Longley Data 


Source df Sum of Squares Mean Squares F 
Regression 6 184, 172, 402 30,695,400 330.29 
Residual 9 836, 424 92, 936 
Total 15 185, 008, 826 


ойс LE LLL MN 
R? = 0.995 ЕЁ = 0.992 


From Table 5.12 we see that F = 330.29 which is significant at « 0.1% level 
indicating that at least one of the variables z,-xg is significant in explaining the 
observed data. Also, the large values of R? and Т indicates that there is almost a 
perfect fit using the predictors 21-25. The possibility that fewer variables may be 
sufficient will be discussed in the next section. 


As for simple linear regression R? has a correlation interpretation. It is the square of 
the sample correlation between y and y. 


5.7 Confidence Intervals and t-Tests for the Coeff- 
clents 


5.7.1 Confidence Intervals 


If the overall regression is found to be significant, attention then generally turns to a 
consideration of which of the individual coefficients is contributing to the fit. This analysis 
may be carried out using tests developed from confidence intervals for the regression 
coefficients as was done for simple linear regression in Chapter 3. The confidence intervals 
we develop are strictly valid only if the standard normality conditions are met. However, 
under appropriate conditions on the errors, they may be valid asymptotically for large 
sample sizes n, and they are routinely used in practice for screening the contribution of 
individual variables even when little is known about the errors. 

To derive these intervals let 6; denote the j-th diagonal element of (X7 X) 7! Then, 
as we showed in Theorem 5.5 


_ 8;—В, 
S4/ ô j 
has a t-distribution with n—m-—1 degrees of freedom. Then using standard manipulations 


as in Chapters 2 and 3 we find that a (1 — o) x 100% confidence interval for 8, is given 
by 


Т (5.220) 


(0; — tn-m-1,a/25V 65 B; +—т—1,а/28\/8з). (5.221) 


If m = 1, these intervals are easily shown to be those derived in Chapter 3 for the 
coefficients of the simple linear regression model. 


5.7. CONFIDENCE INTERVALS AND t- TESTS FOR THE COEFFICIENTS 227 


5.7.2  t-Tests 


Using the confidence intervals given in (5.221) we can develop a test of 

Ho : B; = bj (5.222) 
against 

H, : 8; # bj. (5.223) 
The critical region for a level a test is found by rejecting Ho if b; is not in the confidence 


interval (5.221). Simple calculations show that this region is given by 


^ 


B; — bj B; b 
——R— < —tn-m- a or 2 > tn—m— a 
E $; 1,®/2 p $; 1,a/2 


(5.224) 


or equivalently 
I] > in—m—1,a/2 (5.225) 
where T; = (8, — bj) /58+/8;. 


When b; = 0, we have a test for the significance of an individual regression coefficient. 
That is, we reject Ho : 8; = 0 at level a if 


^ 


B; 
д; 


Shena (5.226) 
$ 


If n —m —1 > 20, one can use an “eyeball” test for testing the significance of B j Because, 
for large n, T; is approximately N(0,1), а 5% level test is to reject Ho if 


[I] > tn—m-—1,0.025 ~ 29.025 = 1.96 = 2. (5.227) 


This allows one to quickly screen computer output for the significance of regression 
coefficients. If no significant model violations or multicollinearity is present, then one 
might consider removing any variable for which |Т;| < 2. In particular, when serious 
multicollinearity is present, then the regression may be significant as measured by the 
F-test, but all or almost all of the variables may appear to be insignificant as measured 
by t-tests. Conversely, as shown by Freedman in [37], when a model has a large number 
of possible variables, then many variables may appear to be significant, even if they are 
random noise! When using £-tests proceed with caution! 

By using one-sided confidence intervals, one can develop tests of Ho : 8; = b; against 
the one-sided alternatives 


Ну В bj or Hı: 8; < bj. (5.228) 
In the first case the critical region for a level о test is to reject Но if 
T > tizm- a (5.229) 
while in the second it is to reject Ho if 


T e EE (5.230) 
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Several examples illustrating these ideas are given next. 


Example 5.23 (Format for listing t-values) Before presenting numerical results for 
Examples 5.24-5.27 we give a standard format for presenting t values for each estimated 
regression coefficient. This format, or slight modifications of it, is standard output from 
most widely available statistical packages. 


Table 5.13 t-values for Estimated Regression Coefficients 


Predictor Coefficient S.E. t-statistic p-value 
Bo P {To > |tol) 
P (Ti > ||) 


to = Bo/é P {T> > ||} 


P {Tm > |tml} 


The first column lists the variables in the model. The second column gives the least 
squares estimates f; of 8,0 < i € m, while the third column gives their standard 
errors. The t values are given in the fourth column and their significance levels (to 
three significant figures) are given in the last. Unless otherwise stated, we will assume a 
significance level of 596, so any value p « 0.05 indicates a coefficient significantly different 
from zero. Sometimes an additional column listing VIF is also given. 


Example 5.24 (t values for Longley data) Using the format in Example 5.23 the £ 
values and their significance levels are given below. 


Table 5.14 t values for Longley data 
Predictor ^ Coefficient S.E. Coeff. t-statistic p-value VIF 


constant —3, 482,259 890, 420 —3.91 *0.004 

Tı 15.06 84.91 0.18 0.863 135.5 
T2 —0.03582 0.03349 —1.07 0.313 1788.5 
T3 —2.0202 0.4884 —4.14 *0.003 33.6 
T4 —1.0332 0.2143 —4.82 *0.001 3.6 
T5 —0.0511 0.2261 —0.23 0.826 399.2 
T6 1829.2 455.5 4.02 *0.003 759.0 


From these results we see that 89, 83, G4 and fg appear to be significantly different 
from zero at the 5% level (indicated by *), while 9,, 8. and B4 appear to be zero. How- 
ever, the apparent multicollinearity in the data and our comments concerning multiple 
t-tests suggests caution in eliminating ту, то and z5 from the model, since doing this one 
can bias the remaining coefficients. Notwithstanding those caveats we refit the data by 
regressing y on the variables 23, z4 and тє. The ANOVA table and ¢ statistics are given 
below. 
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Table 5.15 ANOVA table for reduced Longley data 


Source df Sum of Squares Mean Squares F 
Regression 3 183, 685, 465 61,228,488 555.21 
Residual 12 1, 323, 361 110, 280 
Total 15 185, 008, 826 


EINE c — c -—É PT ee ease) Se Te К a Pk АКАТ 


R? — 0.993 В = 0.991 


Table 5.16 t statistics for reduced Longley data 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant | —1, 797,221 68,642 —26.18 0.000 

T3 —1.4697 0.1671 —8.79 0.000 3.3 
T4 —0.7723 0.1837 —4.20 0.001 2.2 
2 956.38 35.52 26.92 0.000 3.9 


From Table 5.15 we see that the ANOVA tables are quite similar with virtually the 
same values of R? and R’. Hence, the reduced model appears to explain the observed 
data as well as the full model. In addition, the standard errors of the coefficients are 
significantly smaller than in the full model giving highly significant t values. Moreover, 
there has been a substantial reduction in the VIFs indicating that the multicollinearity 
problem has been substantially alleviated. It appears that in eliminating the variables 
21,25 and 25 from the full model we have obtained a model which seems to be much 
more reliable than the original one. 

It is interesting to ask if this is the “best” model. This is a topic we will take up 
in more detail in Chapter 8. For comparison purposes we used a standard stepwise 
regression variable selection procedure on the full Longley data. As we discuss later, 
such methods purport to select the best set of predictors from a given proposed set. 
Using this procedure only the variables zo and тз were selected putting in a variable 
deemed insignificant in our original analysis and omitting two variables that were highly 
significant in the second. Further details are given in Tables 5.17-5.18. 


Table 5.17 ANOVA Table for “best” Longley model 


Source df Sum of Squares Mean Squares F 
Regression 2 181, 429, 761 90,714,881 329.50 
Residual 13 3,579, 065 275, 313 
Total 15 185, 008, 826 


R? = 0.981 Б = 0.978 


Table 5.18 £ statistics for “best” Longley model 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant 52, 382.20 573.5 91.33 0.000 
22 0.037840 0.001711 22.12 0.000 1.6 
T3 —0.5436 0.1820 —2.99 0.010 1.6 


Again we observe that the ANOVA table is similar to Tables 5.9 and 5.15 and the 
overall fit appears to be about the same as the full and reduced models. But there is 
a substantial change in the intercept and a change in sign of the coefficient of тә from 
negative in the full model to positive in the “best” model. 
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Since х9 is the GNP, intuitively we expect total employment to increase as GNP 
increases and the “best” model shows this while the full model gives a result which is 
counterintuitive. Again, this is another consequence of multicollinearity. Clearly, further 
tools are needed to assess the appropriateness of the proposed models and these will be 
developed as we proceed. 


Example 5.25 (t statistics for drink delivery data) Continuing our analysis of the 
drink delivery data we give the £ statistics in Table 5.19. 


Table 5.19 ¢ statistics for drink delivery data 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant 2.341 1.097 2.13 0.044 
Ly 1.6159 0.1707 9.46 0.000 3.1 
T2 0.014385 0.003613 3.98 0.001 3.1 


From Table 5.19 we see that both 8; and 8, are significantly different from zero so 
that including “distance walked” appears to be useful in predicting the time it takes to 
deliver drinks in addition to the number of cases. Without further examination, we can 
conclude that the model 


Y = 2.341 + 1.61592; + 0.01438525 + € 


appears to be quite acceptable for explaining the variation in delivery times. 


Example 5.26 (t statistics for birth weight data) In example 5.15 we found that 
the overall model of two lines was highly significant in explaining the observed variation 
in birth weights. From our analysis in Chapter 3 we expect gestation period to be a 
significant predictor but is “sex” as well? To evaluate this we give the t-statistics in 
Table 5.20. 


Table 5.20 £ statistics for birth weight data 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant —2,142 1,189 —1.80 0.087 

Ly 130.40 30.66 4.25 0.000 2.1 
29 1,042 1647 0.63 0.534 477.6 
Z3 —22.73 42.67 —0.53 0.600 473.3 


From Table 5.20 we see that the results generally confirm the model using “gestation 
period” as the only predictor. 8, is significantly different from zero, while 8, is only 
marginally so. However, the effect of sex is not clear, since the large VIFs for Bo and Gs 
indicate a strong collinearity between these two variables. 

Of course, simply eliminating the variables то and x3 does not seem appropriate, 
since our previous analysis suggests that the sex of a baby influences its birth weight. 
Since (5.101) represents two lines, eliminating x2 would yield a model given by one line 
so it is not logical to include an interaction term which provides for a difference in slope 
in identical lines. Hence, if we eliminate one of the variables it should be x3 first. Doing 
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this and refitting the data gave the following ANOVA table and £ statistics. 


Table 5.21 ANOVA table for reduced birth weight data 


Source df Sum of Squares Mean Squares Е 
Regression 2 1, 209, 033 604,517 20.45 
Residual 21 620, 840 29, 564 
Total 23 1, 829, 873 


(a a a т i | a — 
R? = 0.661 R = 0.628 


Table 5.22 t statistics for reduced birth weight data 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant —1, 702.30 747.00 —2.28 0.000 
24 119.06 19.23 6.19 0.000 1.0 
25 182.12 71.09 2.57 0.011 1.0 


From Table 5.22 we see that the reduced model is significant іп explaining the observed 
birth weights since P {F > 20.45} < 1073. In contrast to the full model, the coefficients 
В, and 8, are significantly different from zero. Also both VIFs equal one so the effect of 
multicollinearity seems to have been eliminated and the signs of both coefficients agree 
with our intuition. As a consequence, the analysis so far indicates that an appropriate 
model for the birth weight data is two parallel straight lines represented by 


Y = —1702.3 + 119.062, + 182.1275 + €. 


Example 5.27 (Hald data) As discussed in Example 5.16 the Hald data appeared 
to fit almost perfectly using the predictors 21-24. However, the large VIFs suggested 
possible imprecision in estimating the coefficients. 


Table 5.23 t statistics for Hald data 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant 62.41 70.07 0.89 0.399 

11 1.5511 0.7448 2.08 0.071 38.5 
x2 0.5102 0.7238 0.70 0.501 254.4 
x3 0.1019 0.7547 0.14 0.896 46.9 
T4 —0.1441 0.7091 —0.20 0.844 282.5 


From Table 5.23 we see that even though the overall fit is highly significant, попе of 
the variables appears to be significant at the 5% level. The only variable which appears 
marginally significant is zı. Since our analysis suggests at least one of r;-r4 should 
appear in the model, it appears that x; is a good choice. In fact, regressing y on x; gives 
the following ANOVA and t£ statistics. 


Table 5.24 ANOVA table for ту in Hald data 


Source df Sum of Squares Mean Squares Е 
Regression 1 1450.1 1450.1 12.60 
Residual 11 1265.7 115.1 
Total 12 2715.8 


R? = 0.534 R? — 0.492 
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Table 5.25 t statistics for тү in Hald data 
Predictor Coefficient S.E. Coeff. t-statistic p-value 
constant 81.4790 4.9270 16.54 0.000 
71 1.8687 0.5264 3.55 0.005 


From Tables 5.24 and 5.25 we see that the regression is significant at « 596 and of 
course 8; is seen to be significantly different than zero. However, R? is only about one- 
half that of the full model indicating that at least one other variable is important. From 
Table 5.23 it is reasonable to consider r$ since it has the second largest t value. As а 
consequence, we refit the data using zı and z2 and the results are given in Tables 5.26 
and 5.27. 


Table 5.26 ANOVA table for Hald data using z1, 22 


Source df Sum of Squares Mean Squares F 
Regression 2 2,657.9 1,328.9 229.5 
Residual 10 57.9 5.8 
Total 12 2,715.8 


EURO лл Е_-_ => Б = RÀ 
R? = 0.979 R = 0.974 


Table 5.27 t statistics for Hald data using z1, x2 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant 92.911 2.286 23.00 0.000 
Tı 1.4683 0.1213 12.10 0.000 1.1 
T2 0.66225 0.04585 14.44 0.000 1.1 


From Table 5.27 we observe that the model containing zı and x2 is highly significant 
in explaining the data since P {F > 229.5} < 107? and the R? and Т? values are almost 
the same as in the full model. From Table 5.27 we see that both 6, and 8, are significantly 
different from zero. In addition, the VIFs indicate that the multicollinearity has been 
alleviated. At present, it appears that the fitted model 


Y = 52.577 + 1.46882, + 0.662252 + € 


provides a good explanatory model for the Hald data. 


From Examples 5.24-5.27 we see that the statistics F, R? and t are useful in helping to 
disentangle the effect of various predictors on a given set of observed data, but they cannot 
be used in a completely automatic fashion. Despite the fact that modern computers have 
enabled us to do regressions “at will", human intervention is still needed to understand 
possible contradictory results. 

As for simple linear regression, additional tools are necessary to further understand 
complicated relationships. Those will be developed as we proceed. | 


5.8 The Extra Sum of Squares Principle 
5.8.1 The General Linear Hypothesis 


So far we have considered two types of tests for the general linear model: the F-test 
for the overall significance of the regression and £ tests for the significance of individual 
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regression coefficients. In this section we will show that these tests may be considered as 
particular cases of a test for a general linear hypothesis of the form 


Hy: C8 =b (5.231) 


against 
Hi:CO zb (5.232) 


where C is an т x (т + 1) matrix of rank r. (Thus, r € m +1.) 
The equation CÓ = b represents r linearly independent constraints 


heb TRS (5.233) 
7=0 


on the coefficients of the GLM and the purpose of the test we develop is to determine 
if the data can justify the imposition of these constraints. Rejecting Ho indicates that 
the data cannot support the possibility that all r conditions given in (5.233) hold and 
we would conclude that at least one of them fails. 

Before indicating the nature of the test we choose, let us show that by appropriately 
choosing C and b we arrive at the hypotheses for the overall significance of the regression 
and those for testing the significance of the individual coefficients. 

In the first case, letting C denote the m x (m + 1) matrix 


(5.234) 
0 0 2. Ge VE 
and b = (0,0, DEC. then the conditions 8, = 8, =... = m = 0 are equivalent to the 
vector equation 
СВ = 0 (5.235) 
In the second, the hypothesis that 5, = 0 can be written as 
СВ =b (5.236) 
with С = |0, ...,0, 1,0 40 and b = 0. 
As an additional example, suppose that in the model 
Yx = Bo + 8121 + 822 + Өзхз + Bara + Ex (5.287) 
we wanted to simultaneously test that 8, = 0 and B4 = 84, then this could be done by 
letting 
010 0 
с= | Ж A (5.238) 


and b = (0, 0)” as the reader may easily verify. 
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As a less arbitrary example, suppose that in the salary model proposed in Example 
5.5 we wanted to test the hypothesis that both male and female salary structures were 
the same. Then this would require testing 


Но: б» = Вз = 0 (5.239) 
against 
Ну: 8,40 or 84 £0. (5.240) 
If 
0.0 1 0 
с= [0 0 0 | (5.241) 


and b = (0, 0)7, then these hypotheses can be written as 


Ho: СВ = 0 (5.242) 
апа 

Hı: СВ #0. (5.243) 
5.8.2 Тһе F-Test 
To develop the test of 

Ho: CB —b (5.244) 
against 

Hi: СВ z b, (5.245) 


we begin by fitting the unconstrained model, where CG 5 b holds. This is usually called 
the full model. We then fit the model under the condition that Но is true, i.e., that 
СВ = b. This is usually called the reduced model. 

Now let SS Ер denote the residual sum of squares from the full model and S.S Eg that 
from the reduced model and note that it will be the case that SSEr < SSEg. However, 
if Ho is false, then we would expect the reduction in the residual sum of squares 


ASSE = SSER — SSEr (5.246) 


to be significantly greater than the random error o?. Hence, we would expect that a 


reasonable test would compare ASSE to s? and reject Ho if ASSE/s* was sufficiently 
large. This test will be based on a generalization of the F-test for the overall significance 
of the regression based on the F' statistic 


_ ASSE/r 


F E 


(5.247) 


When the errors in the GLM are independent N (0, c?) random variables and Hg is true 
then F has an F-distribution with (r, n — m — 1) degrees of freedom. Thus, the critical 
region for rejectiong Ho at level a is 


Р > frn-m-ia- (5.248) 


Before proving this (which leave as optional reading) we will show that this F-test 
contains as particular cases, the F-test for the overall significance of the regression and 
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is equivalent to the t-test for the testing the significance of an individual regression 
coefficient. 


Example 5.28 When the standard normality assumptions are met for the GLM, 
then the F-test given by (5.248) is the same as that given in Section 5.6 for the overall 
significance of the regression. 

To prove this we note that here the reduced model is given by 


Y; = у+є,1<1<п, (5.249) 
so that SSEr = Y (y; — y). = SST. Thus, 
ASSE = SST- 85Ер 


n 


= Уи 0) -S u-t)? = SSR. (5.250) 


i—1 


Hence, the reduction of the residual sum of squares is just the regression sum of squares 
and F in (5.250) is (SS R/m) /s? which is the F statistic given previously and so has an 
F distribution with (m,n — m — 1) degrees of freedom. 

To get the result concerning £ tests for individual coefficients we will need to use a 
result which will be established in Theorem 5.10. It will be shown there that 


ASSE = (CBp -b,A (CB, Ж ь)) (5.251) 


where 


A= [С (XTX) ст) x (5.252) 


When testing for the significance of the i-th coefficient 8;, C = o. 450, y 0,...,0 
i 
and b = 0 so that (show this) 
C (XTX) CT = 6, (5.253) 


where 6; is the i-th diagonal element of (XTX) `. Thus, if 8; = 0, 


ASSE = + (5.254) 


and 


22 
ASSE B 4 

d cc a 
where T; = 8;/s4/6; is the Т statistic for testing Ho : 8; = 0. Thus, the F test in (5.255) 
is equivalent to rejecting Не if 


(5.255) 


ана (5.256) 


and this is equivalent to rejecting Ho if 


MH > y песие 96 T Un m—1,a/2- (5.257) 


For this reason, the t-tests are often referred to as partial F-tests [27]. 
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5.8.3 Derivation of the F-Test 


Theorem 5.10 In the GLM with independent N (0, о?) errors, let SSEp denote 
the residual sum of squares from the full model, and let SSEg denote the residual sum 
of squares when Но: СВ = b is true. Then, when Но is true 
_ ASSE/r 


F z 


(5.258) 
has an F-distribution with (r,n — m — 1) degrees of freedom. 


In order to prove the Theorem 5.10 we need to derive the formula for ASSE given 
in (5.258). Because of its importance we state this result as a separate lemma. 


Lemma 5.2 lf Br and Br denote the least squares estimates of B in the full and 
reduced models respectively, then 


(i) Bp = Bg — (XTX) CTA (CB, = b) where 
А = [C(X7X) gr (5.259) 


(ii) ASSE = SSEn – SSEr = (CB — b, A (CBp - b)) | 


Proof. (i) From Theorem 5.1 Bp = (XTX) XTy, so it remains to find an 


expression for @ в- We will do this by using the method of Lagrange multipliers. In 
this case, we want to minimize g (8) = (y — X8, y — XB) subject to the r constraints 
CB =b. 

Letting A = (Ау, Аз, ..., 4.) denote a row vector of Lagrange multipliers, [2 в сап be 
found by minimizing the function (Lagrangean) 


L = g(8) —2(A, B — b) (5.260) 


with respect to (8, A). 
To do this we calculate 


aL aL aL DI 

aru d шш зе REA ‚2 

9B (ая JA a) cde 
and Е 

Әг (OL OL aL 

ЭХ == (25, gx a) (5.262) 


and set the resulting partial derivatives to zero. 
Now, using (5.260), 


(8, XTXg) =:2 (8, XT y) Da (У, у) =2 (А, СЗ) =2 (A, b) 
(8,XTX8) -2(8,X Ty) + (у,у) —2(C7A,B) – 2 (А,Ь) (5.263) 


L 


Il 


so that 
OL 


93 2(X7X) 8 — 2XT y —2C7A =0, (5.264) 
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and 
OL 


Or 
Denoting the solution to (5.264) by Вр we find that 


=CB-b=0. (5.265) 


Bg = ex: Ху: (XTX)" CTA-,- (XX CTA. (5.266) 


Thus, 
Сд, — b = сёр — b +C (XTX) CTA (5.267) 


and using CB в ~ b = 0 to solve for A gives 
qu usn] 2 
= [С (X'X) © | (CB, -b). (5.268) 
Substituting this expression for A into (5.266) and using the definition of A yields 


Bp = Êr – (ХІХ) CTA (Сё, = b) (5.269) 


as required. 
(ii) For the proof of (ii) we note from (5.269) that for any 8 


9 (8) - 9 (Br) = (x (8-Br),X(8-Br)). (5.270) 
Letting В = Вр in (5.270) shows that (recall g (8) from (5.20)) 


ASSE = 9(Bp)-9(8r) 


- (X (PBs) X (bn - 89) 
= (x (x?x) CTA (с CB, – b), х (X7X) CA (CB, = ь)) 
= (CBp—b, ATC (XTX) ` XTX (XTX) CTA (CB, —b) ) 
= (Сд, ~b, ATC (ХТХ) СТА (с CB, - b)) (5.271) 
Now А = AT (verify this) so that 
ATC (XTX) ‘СТА = ААА =A (5.272) 
which gives 
ASSE = (Сд - b, A (Сё - b) (5.273) 


as required. M 


Proof of Theorem 5.10. We begin by showing that ASSE/o? is x? (т) if Ho 
is true. If this is the case, then Y ~ N (8һ,021,) and Bp = (XTX) XT y so that 


^ 


= Cf — b is N (E (ê), X (f)). 
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But, 
Е@Ф) = E(Cé,- b) =E |с OU) xTy] -b 
= C(XTX) XTE(Y)-b- C(XTX) XTX8& —b 
= C§,-b=0. (5.274) 
Also, 
© (8) = CE (bp) СТ = o?C(XTX) CT (5.275) 
so that 
CB -b ~ N (0,0°C (XTX) CT). (5.276) 


E 2 
Ѕіпсе А = Ic" (XTX) c| , ê = Cpr — b ~ N (0,0? A7!) if Ho is true. Now using 


the result in (iv) of Theorem 5.5 we find that 
ASSE _ ( 


в? в? 


m 
» 
> 

чы 


(5.277) 


has a x? (r) distribution. К 
Since ASSE is a function only of Gp, it follows again from (i) of Theorem 5.5 that 
ASSE and s? are independent. Thus, 


2 
F- EUG D (5.278) 
has an F-distribution with (r,n — m — 1) degrees of freedom. MI 
Using the basic decomposition of SST = SSE + SSR we also find that 
ASSE = SSRp — SSHg (5.279) 


so that ASSE is also the increase in the regression sum of squares due to the failure of 
Ho. Thus, 
SSRr = SSRr + ASSE. (5.280) 


In this case ASSE is often referred to as the extra sum of squares contributed to the re- 
gression sum of squares due to the failure of Но. For instance, if 8 = (Bo, Bises с РАСЕ 0) T 
(i.e., Bm = 0) and the true model has 8 = 8р, then ASSE is the eztra regression sum 
of squares due to the addition of m to the model. For this reason the F-test given by 
(5.278) is sometimes called the ertra sum of squares principle. We now present a couple 
of numerical examples illustrating its use. 


Example 5.29 As we have seen in Example 5.12 (housing data), the test for the 
overall regression is very significant. Now, we consider the question whether there is a 
significant increase in the variation explained by the additional term z3 = (Age)? = 22. 
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The ANOVA table can be made by the decomposition of the sum of squares as follows: 


Table 5.28 ANOVA for Housing Data 


Source df Sum of Squares Mean Squares F 
Regression 3 16, 873, 260, 084 5,624, 420, 028 102.60 
21 (1)  15,398,166,924 15,398,166,924 (280.88) 
z5|zi (1) 1,354, 937,318 1,354,937,318 (24.71) 
їз|21,29 (1) 120, 155, 842 120, 155, 842 (2.19) 
Residual 11 603, 029, 249 54, 820, 841 
Total 14 17,476, 289, 333 
В? = 0.965 R = 0.956 


To test the null hypothesis Не : the addition of x3 (ог 83) to the model does not 
significantly improve the prediction of Y, we must calculate the extra sum of squares due 
to the addition of 33x73 in the model. This sum of squares can be calculated simply from 
the formula given in Eq (5.279) and also given in Table 5.28, 


ASSE = SSRr — SS Ер = 120, 155, 842. 
Therefore, we compute the partial F-statistic as 


S'S (z3|11,22) . 120,155,842 _ 


F (zs|zi,22) = ~ 2.19, 


~ MS Residual (21,22,23) 54, 820,841 
and this F-statistic has an F-distribution with 1 and 11 degrees of freedom under Ho, 
so we would not reject Ho at a = 10%. 


Example 5.30 (Joint Confidence regions) An interesting consequences of the pre- 
vious calculations is à method for obtaining joint confidence regions for the regression 
coefficients (B, P, , ..., cr. From (5.276) letting C = І,„ +1, we see that the quadratic 
form 


^ —] f^ 
Q = (B- 5, (XTX) (8-8) (5.281) 
has a chi-square distribution with m + 1 degrees of freedom. Hence, 
Q/ (m + 1) Q/ (m + 1) 
Б = “= аА 5.282 
82 SSE/ (n — m — 1) ( ) 
has an F-distribution with (m + 1,n — m — 1) degrees of freedom. 
Hence, 
I [F < Ja би Ану =1—@ (5.283) 
so that the region {F € fa (m+1,n-m+1)} is a (1 — а) 100% joint confidence region for 
(Bo, P 9997 Bin) ; 


Specializing to the case m = 1 gives the joint confidence region in (3.121) for the 
parameters (80, бү in the SLR model, which is an elliptically shaped region. We leave 
the details to the reader. 
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5.9 Prediction 


5.9.1 Predicting E(Y,) 


Once we have fitted the model (5.11) to our data and have decided that the fit is adequate, 
we can then use the model to make point and interval estimates of both E (Yx) and Yx 
as in the case of simple linear regression. We begin with the estimation of E (Y y). 

If xo = (1,201, 20,2; ..., Zo,m) is a point in the domain of the independent variable x, 
then a point estimate of E (Yx) is given by 


jx, = Bo + Y 0,58; = xo (5.284) 
j=l 


(regarding xo as a row vector and as usual, {3 as a column vector). 
From (5.284) we find that 


Var (Peo) = 0? (хо, (XTX) ^ xo) = 0? [xo (XTX) x$]. (5.285) 
And this can be estimated by 
6? (Feo) = 5? [xo (XTX) ху] (5.286) 


^ 


where ô? йы) is called the estimated prediction variance. In particular, if хо = x; = 


(1,241, 243, ..., Zim) the i-th design point, then 


e? (Ж) = 5? s Ux XP end hs (9.287) 


1 


where h,; is the i-th diagonal element of the hat matrix X (XTX) ‘хт. (Prove this.) 

Under the standard normality assumption of the GLM we can use the above results 
to obtain confidence intervals for E (Уо) = Hx,» In fact, since Y,, is a linear combination 
of joint multivariate normal random variables, it is normal with 


E (Y4,) = Pg (5.288) 


and Var (Ж) given by (5.286). Thus 


^ 


Yxo — Hx 
NOS. umi. NEN (5.289) 
cA xo(XTX) | xT 
is N (0,1) and by the independence of B and s 
Yo 
UM a (5.290) 


s\/xo (XTX) ! xf 


has a t-distribution with n — m — 1 degrees of freedom. From this it follows by standard 
manipulations that а (1 — œ) x 100% confidence interval for px, is given by 


(5. + tn-m-1,a/25\/ x0 (XTX) <) (5.291) 
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Example 5.31 (Housing data) Suppose that a realtor wishes to construct a 95% 
confidence interval for the price of a house with zı = 2,400 square feet and the age of 
the house x2 = 5 years. Let хо = (1, 2400,5) and from the previous results 


А 13239 
B= 60.589 |, s = 7763. 
—1726.8 


Using (5.284) the point estimate of the price is, jx, = $150, 018.6. Then, we calculate 


15 31050 96 
X'X = 31050 69022500 208750 |, 
96 208750 1090 


7 0.977993 —0.000426 —0.004463 
(XTX) = | -0.000426 0.000000  —0.000005 
—0.004463 | —0.000005 0.002201 


Hence Е 
хо (ХХ) хо = 0.0992746. 


Since п = 15 and т = 3, ti2,0.025 = 2.179 from Table A.2. Therefore, from (5.291) the 
95% confidence interval for the mean predicted value at zı = 2,400 and r$ = 5 is given 
by 

150, 018.6 + 2.179 (7763) V0.09927 = ($144, 688.86, $155, 348.34). 


5.9.2 Prediction Intervals 


If we wish to predict the value of a new observation at xo, then again the point estimate 
Y. is used. If Yx, is the true value of Y, at x = xo, then Yx, and Fe will be independent 
provided that the observations Y;,, Y1, Y2,..., Y, are independent: which we assume to 
be the case. Then, 


Var (Ys, — Yeo) = Var (Yx,) + Var (Fao) 
= c |1+x0(X7X) xD. (5.292) 
so that . 
Yx, E Yx, 


(5.293) 


cA l4 xo(XTX) xT 


is à N (0,1) random variable under the standard normality assumptions of the GLM. 
From this it follows easily that 


Yxo 7 Ys, 
1 4- xo (XTX)! xl 


(5.294) 


has а £ distribution with n — m — 1 degrees of freedom. Using standard manipulations 
we find that a (1 — a) x 100% prediction interval for Yx, is given by 


(5. £ tn—m—1,a/28 1+ хо (XTX)? <) . (5.295) 
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Example 5.32 (Drink delivery data) Suppose that a deliveryman wishes wants to 
construct a 90% prediction interval for the delivery time when he has zı = 12 cases, and 
distance, х2 = 400 feet. First, in order to find the point estimate of the delivery time, 
let хо = (1,12, 400) . From the previous results, we have 


Й 2.341 
В = 1.6159 |, s = 3.2594 
0.014385 


so the point estimate, a = 27.49 minutes. Further calculations give 


25 219 10, 232 
XTX = 219 3,055 133,899 |, 
10,232 133,899 6,725,688 
T 0.1132152 —0.0044486 —0.00008367 
(XTX) = —0.0044486 0.0027438 —0.00004786 |, 


—0.00008367 —0.00004786 0.00000123 


hence we obtain E 
xo (XTX) xj = 0.0717868. 

Therefore, from (5.295) with £220.05 = 1.717, the 90% prediction interval on the delivery 
time for z; = 12 cases and x2 = 400 feet is given by 


27.49 + 1.717 (3.2594) V1 + 0.07179 = (21.70, 33.28). 


5.9.3 Extrapolation 


In the case of simple linear regression we showed that the quality of prediction depended 
essentially on the distance of the data point from the mean of the observation points Т. 
Because of this we urged that readers to be cautious in using the fitted model to make 
predictions outside of the interval [zimin; Zmax| where the data were taken. These remarks 
apply to the case of multiple regression as well; but with some additional problems. 

First, it is not immediately clear what the region for extrapolation is. Since the 
prediction variance is proportional to xo (XT X) RE xà points with “large” values of this 
quantity should be avoided since prediction there will generally be unreliable. In partic- 
ular, if хі is an observed data point, then one can expect that those points for which 
hii = х; (XTX) x7 is largest would lie on the boundary of the set where prediction is 
reliable. This suggests that points that lie inside the ellipsoid 


x(XTX) x? < max hj, (5.296) 
1<:<п 

тау be considered as acceptable places to make predictions, while those outside this 
region are considered as unacceptable. For example, Figure 5.8 illustrates two points 
(211,212) and (221, 222) lie within the range of both regressors zı and zz but outside the 

joint region of the original data. 
In particular, one needs to be careful in using as prediction points those points, whose 
coordinates are smaller in absolute value than the largest of the absolute values of the 
components in the data vectors x;, since such points will generally not lie in the region 
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X 
X12, X22) 
Range | Joint region of | 
of X; 


Figure 5.8: An example of extrapolation 


determined by (5.296). This problem is sometimes called hidden extrapolation, and one 
should generally check to see whether (5.296) is satisfied before using the model to predict 
at a new point z;,1l «X € n. 

Some writers have proposed a more stringent definition of the acceptable prediction 
region [87]. They take this to be the smallest convex set containing X1, X», ..., Xn and call 
it the independent variable hull (ТУН). (In the case of simple linear regression this is just 
the interval [zmin; Xmax].) Since this set may be quite difficult to determine explicitly, 
the ellipsoid (5.296) provides a computable compromise since it contains the IVH and 
the data points. 


5.10 Exercises 


5.1 Write out the normal equations explicitly for m — 2 in (5.1), and find the least 
squares estimators ĝo, 8; and f». 


5.2 Consider a multiple linear regression model with m > 1 in (5.1). Under the 
assumption that the errors are i.i.d. N (0, c?) for a random sample of size n, show 
that the MLE of c? is given by 


TL 


x 1 en 
ÔMLE = = У vi =): 
i=1 


5.3 Let A =1„—Х(Х”Х) ! X" and B = (XTX) ^ XT. Show that AT= A, A?— A 
and BA = 0. 

5.4 (Vector differentiation). If 2/08 = ((8/88;));.y , show that 
(a) д (87а) /OB =a. 
(b) 8 (87А) /08 = 2A, where A is symmetric. 
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5.5 By differentiating g (8) = (y — X@,y — XØ) with respect to 6;,7 = 0,1,2,...,m, 


R А T 
show that the minimizing value Ó satisfies the normal equations (XTX) 8 = X у. 


5.6 Consider the sample model corresponding to (5.1) as 


yi — Bo F y» d £4 і = 1, 2, 9 П. 
j=l 
(a) Find the least squares function g (B) = g (Bo, B1, ...,8,,) = У е2. 


(b) Differentiate the least squares function found in (a) with respect to 6) and 
B; (7 = 1,2,..., m) respectively. (Then set them equal to zero.) 


(c) Write down the least squares normal equations by simplifying what you found 
in (b). 
5.7 Consider a multiple linear regression model Y = X8 + =. Prove X78 = 0. 


5.8 For any linear model, show that 


1€ х 1 " 2 
-Y Var (£) Кы |X(X^x) p gi Der. 
TL icl n n 


5.9 Consider the hat matrix H = X(X^X) !XT. Show that SSR = YTHY = 
ҮТҮ = ҮТНЗҮ provided Y = 0. 


5.10 Consider a multiple linear regression model in (5.1). Let 17 = [1,1,..., 1] and 


note that X71 is the first column of the design matrix XT X. Show that 
(a) (X X) ! X71 = (1,0,..,0)7 ; 

(b) 17X(X^ X) ! X71 =n. 

[Ref: The American Statistician, pp. 47-48 April 1972] 


1xn 


5.11 Suppose that a set of data is given by 


y 21 T2 

8 4 2 

1 9 —8 

0 11 —10 

5 3 6 

3 8 —6 

2 5 0 
—4 10 —12 
11 3 5 
—3 7 —2 
5 6 —4 


(a) Define the design matrix and calculate the inverse of the XTX matrix. 
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(b) Using the postulated the model for the data, estimate @;,7 = 1,2 in the model: 


Y = Bo + 0121 + 8522 + €. 


(b) Construct the ANOVA table. 


(c) Test to determine if the overall regression is statistically significant. Take a = 


5%. 


(d) Calculate the variances of 3, (= 51), д, (= b2), and the variance of the predicted 


value of Y for the point ту = 3, z2 = 5. 


(e) Calculate R? and give a comment. 


5.12 Suppose a sample of size n — 17 (including repeated runs) was obtained from an 


experimental study: 
Response (y) 

64, 71 

68, 73 

60, 60, 77, 78 
12,80 

78, 61, 79, 64 
87 

70 

96 


( 
( 
( 
( 


5.13 Let the response Y be a function of three independent variables z1, х2, and z3. 
Suppose that n = 7 data points are shown in the table below. 


Obs. No. 21 
—1 
—2 
0 
—3 
1 
3 
2 


NO oR WN н 


(a) Fit these data to the model: Y = 50 + 6,21 + 8522 + 8323 + E. 
(b) Find the predicted value, ў when ту = 1, 22 = —3, тз = —1. What is ў — y? 


(c) Test Ho : 83 = 0. Take a = 0.05. 


(d) Construct a 95% confidence interval for the mean response of Y, given xı = 1, 


ло = —3, T3 = —1. 


(e) Find a 95% prediction interval for Y, given z; = 1, x2 = —3, z3 = —1. Compare 


this to the anwer in (d). 


a) Fit the model: Y = Bo + 6,21 + 8522 + €. 
b) Construct the ANOVA and perform the test for lack-of-fit. 


c) Calculate the residuals and plot them. Also give a comment. 


T2 
—3 
0 
—4 
5 
—3 
5 
0 


T3 
1 
1 
0 
—1 
—1 
1 
—1 


T2 
71 
43 
27 
32 
90 
40 

7 
96 


d) Assess the postulated regression model using F, R?. 


со со Юю к н о Oe 
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5.14 Find the MLEs of the regression coefficients for a multiple regression model 
Y = bo + 375. Bj2; + €, m = 2. Assume that єг are independent N (0,07). 


5.15 Show that the fitted plane ў = Bo + 2503 B52; passes through the point 
(¥;X1, Xe, Ves Xm). 
5.16 Let Y1, Ү,..., Yn be observations, which can be postulated by the model 
Y; = Ва? Ben pei 


where z;'s are fixed constants and e;'s are i.i.d. N (0, a?) : 
(a) Find the least square estimator of 8. 
(b) Find the MLE of f. 


5.17 A researcher is interested in studying comparison of the growth rates for bacteria 
types À and B. The rates of growth were recorded at each of five points of a time 


period. 
Time 
Bacteria Type 1 2 3 4 5 
А 80 90 391 102 10.4 
B 10.0 10.3 12.2 12.6 13.9 


(a) Using a dummy variable (ту) for bacteria type, fit the model: 
Y = Bo + 8,21 + 8522 + 832122 + є. 


(b) Plot the data points and graph the growth lines for bacteria types A and B 
respectively. Note that 94 is the difference between the slopes of the two lines and 
represents the interaction between time and bacterial type. 


(c) Do the data indicate sufficient evidence that there exists a difference in the 
rates of growth for the two types of bacteria? 


(d) Find a 9596 confidence interval for the mean growth rate for bacteria type B at 
time тә = 4. 

(e) Find a 9596 prediction interval for the growth rate Y of bacteria type B at time 
T2? = 4. 


5.18 In an experiment, a linear model Y; = 8, + D + gi, i = 1,2,...,8, 18 
considered. From the data set we have 


ПЕ ox 5] 9 

E balal 8 

ї wb dp eub 3 Bo 

lc Xo ss 7 Ад, 
ИИ ОЕ 8, 

1. dq oed d 1 b3 

óc. do d 8 

1 1 1 1 6 


(a) Calculate XTX, (XTX), XTY. 
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(b) Find the estimated vector f. 
(c) Construct the ANOVA table. 
(d) Test Но: 8, = 85 = 0. Take a = 0.10. 
(e) Compute the Д? for the reduced model Y = 8, + 842172 + €. 
5.19 The cloud point of a liquid is a measure of the degree of crystallization in a 
stock that can be measured by the refractive index. It has been suggested that the 


percentage of I-8 in the base stock is an excellent predictor of cloud point. The 
following data were collected on stocks with a known percentage of 1-8 [27]. 


961-8 Cloud Point, Y | 96 I-8 Cloud Point, Y 
22.1 2 26.1 


© 


ою 00-10 сл + wD н 


(a) Fit the data using the second-order model: Y = 8, + 8х + 8,11? + e. 

(b) Is the overall regression model in (a) significant? Take a = 0.05. 

(c) Construct the ANOVA table and perform the test of lack of fit for the model 
in (a). 

(d) Fit the data using the first-order model: Y = 6) + 6,2 + e. 


(e) Use the residuals from the result in (d), make a conclusion whether the first- 
order model would have been sufficient. 


5.20 Suppose we have a parameter vector В = [@,, Bo, Bs, Ag to be estimated in a 
linear model. Suppose one wishes to test the following hypotheses. Specify the 
matrices С and b in the linear hypothesis of the form Но: СВ = b. 


(a) Ho: 84 = b3, b2 = 84. 
(b) Ho : 81 = 83 = 0,85 = B4 = 0. 
(c) Но: 8, = b2 = 83 = B4 = 0. 

5.21 Ге Yi = +B, +i; (i zl2sdpe ld зы J) , where 255 4:8; = 0 (25; di € 0) 
and E (=:;) = 0 for all i, j. Using the method of Lagrange multipliers find the least 
squares estimates of и and 8, [104]. 


[Hint: Show that the Lagrange multiplier is zero.] 
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Сһарїег 6 


Residuals, Diagnostics and 
Transformations 


6.1 Introduction 


As for simple linear regression, it is important to check the validity of a proposed model. 
Of course, the statistics such as R?, t and F tests are important methods for doing 
this. However, these tests were all derived under the assumption that (5.1) was the true 
model and the assumptions about normality and constant variance of errors were valid. 
In general, as we noted in Chapter 3, these assumptions can usually only be tested after 
the model has been fitted, and as was done there, the examination of residuals is an 
important tool for doing this. 

In addition, it is important to examine the data itself for effects on the fit, because 
even if the model is correct, the available sample of observations may make it difficult to 
obtain good estimates of the parameters. As we have shown in Examples 5.24 and 5.27 
strong multicollinearity can confound the parameter estimates, even if the overall model 
fits the data well. In addition, it is important to examine the effect of the observations 
on the fit, since particular cases may unduly affect parameter estimates. Traditionally, 
this was done by looking for outliers in the fitted values y. Clearly, a large residual at 
a given observation y; suggests that there may be a problem with the model. However, 
the opposite can happen as well, a discrepant point may be masked by the fact that it 
is highly influential on the fit. Over the past 20 years this subject has gained increasing 
attention among statisticians and such examination should become a routine part of 
modern regression analysis. 

In this Chapter, we will develop a number of tools, generalizing those we discussed in 
Chapter 3 and a number of new regression diagnostics to look for influential data points. 
Finally, if these procedures discover apparent model violations, we discuss a variety 
of techniques for variable transformations and methods for correction of non-constant 
variance. 
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6.2 Residuals 


In this section we establish a number of properties of the residuals from the least squares 
fit of (5.1). We begin by summarizing some of those established in Chapter 3 and Chapter 
5 and discuss several new ones which are important in justifying the various residual plots 
discussed in Section 6.3. 


6.2.1 Properties of é 
Recall from Chapter 5 that the fitted values y are given by 
$-XB-X(X'X) ` XTy = Hy (6.1) 


where, as before, H is called the “hat matrix” - so-called because it transforms y into y 
(y-hat). Using (6.1) the residuals ё are defined by 


é=y—-y=y-—Hy=(I-H)y. (6.2) 


If (5.1) is true, then Y = XG + e, so that (using (I Н) X = 0) 
ё = (I- H)(Xg ^ е) 
= (I-H)X6+(I-H)e 
(I— Н) є. (6.3) 
From (6.3) it follows that 
Е (ê) = (1- Н) Е (є) = 0 (6.4) 


and using the fact that I — Н is symmetric and idempotent 


1—Н) X (e) (I- H) 
c? (I-H)? = c? (1 Н). (6.5) 


X (ê) 


|| 


Moreover, if € ~ N (0,071), then it follows from (6.3) that ê ~ N (0,07? (I — H)). From 
(6.5) it follows that 
Var (2;) = c? (1 = hii) (6.6) 


so that (hi; is the i-th diagonal element of Н) 


and 
Cov (2;, êj) = —o7 hij. (6.8) 
As noted in Chapter 3, it is generally preferable to standardize the residuals in some 
ways. Since Var (2;) = o? (1 — hi), ri = &i/o 1 -— hi; has Var(r;) = 1 and if we 
estimate о by s = J/SSE/ (n — m — 1), then 
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is called the internally studentized residual. On the other hand, if с? is estimated from 
the least squares fit by omitting the i-th observation, then, denoting this estimate of c 
by д(—) 

ге соТ у 1<{<птп (6.10) 
is called the externally studentized residual. In computer programs f; is frequently referred 
to as RSTUDENT. We leave it as an exercise for the reader to show that these formulas 
reduce to those given in Chapter 3 for simple linear regression. 

From (6.9)-(6.10) we see that if h;; is small, then for large n, we would expect that 
ĉi, f; and t; should all be approximately N(0, 1) and so should behave roughly the same. 
For small n (say n < 20) and/or hj close to one, then either 7; or f; is to be preferred 
over ё; and current statistical practice seems to favor using £;, particularly if the leverage 
hj; of yj is large. 

In fact, as we will show, hi; plays a major role in many of the currently used regression 
diagnostics. For future reference we now discuss some of its basic properties. 


6.2.2 The Leverage h; 


Since Var (8.) > 0, it then follows from (6.6) that hj; < 1. In addition, since Н? = Н, 
then 


hii = ` hikħki = Уу, (hax)? . (6.11) 
k=1 k=1 


The importance of h;; comes from the following observation. Using the fact that 
y = Hy 


Thus, hi; > 0. (One can actually show the stronger inequality hi; > 1/n [20].) 


TL TL 
t = 3. hijy; = Һу; + 3 ууз. (6.12) 
j=l 3 

Now if hj,i # j are “small”, then it follows from (6.12) that 0; ~ huyi so that hj 
measures the effect that y; has on determining the fitted value $;. If hi; ~ 1, then 9; ~ yi 
and the model is forced to fit the model through the point (xj, y;) even if the model is 
not valid there. These points with “large” h;; are called high leverage points and should 
be singled out for further investigation. Note from (6.12) that if y; is a high leverage 
point, then ё; could be large, but £; “small”, with the leverage of y; masking the effect of 
a poor model fit at y;. This masking effect is important to be aware of when using plots 
and/or other diagnostics to evaluate the model 

A further issue concerns what values of h;; should be considered large. А current 
popular heuristic for doing this results from the following argument. We first observe 
that 


л, 
X hu = (Н) = (һы) =т+1 (6.13) 
i=l 
as shown in the proof of Theorem 5.4. Since h;; > 0, then (m + 1) /n is the average size 
of h;; and Belsley Kuh and Welsch (BKW) [8] recommend defining hi; as large, if hi; > 
2 (m + 1) /n. In general, modern regression software will flag points with [RSTUDENTI| 
> 2 and or hj; > 2 (m + 1) /n as points to be considered for further consideration. We 
will return to this matter shortly. 
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6.3 Residual Plots 


6.3.1 Normal Plots 


In general, as in Chapter 3 various residual plots can be used to check for the validity 
of model assumptions. To check for normality one can use histograms or normal plots 
as for simple linear regression (SLR). Since these plots do not depend on the number of 
independent variables they are obtained as for SLR. 

Although there are several analytic tests for normality [104, 27], they are quite com- 
plicated and seem not to be widely used. 


6.3.2 Variable Plots 


To check for model violations in the functional form of Е (Yx) and/or non-constant 
variance one typically uses the following plots: 

i) Plots of 2;, f; or f; against х},1 < ] < т, the j-th column of X. 
ii) Plots of ё;, f; or f; versus y. 


ii) Partial plots. 


( 
( 
( 
(iv) Other plots as indicated. 


To justify (i) and (ii) we note the following. From the least squares equations we have 
XT (v = хд) = 0. Since XB = ӯ, XT (y — ў) = ХТё = 0. From (3.162) if we regress ё 


on х; the slope of the regression line is 
^ ж x 2 
B; = 65,8) / Ilf. (6.14) 


Since the j-th row of XT is the j-th column of X, it follows from (6.14) that if (x5, ê) = 0 
that B; = 0,1 € 7 € m. From this argument we see that if (5.1) is the true model then 


plotting 2;, f; or tj against ту should give a random scatter of points about the "z"-axis. 
In using &, 7; or t;, these points should lie roughly in а band between +2 on the ё 
axis. Deviations from this suggest nonlinearity of the model in x; and/or nonconstant 
variance. 
For (ii) we showed in Theorem 5.6 that У)”, 0;2; = 0, so that the slope of the least 


squares line (through the origin) of ё; against ў; is 
EE NV 
& = (2,9) 9]? = o. (6.15) 


So again, if the model is true, plots of & against y should be randomly scattered about 
the abscissa as in indicated in Figure 3.17. Deviations from this pattern again suggest 
nonconstant variance if there is a funnel shape, or nonlinearity if there are “trends” in 
the plot. Some authors have suggested that variance inequality is more easily seen by 
plotting |é;| against ў;,1 <i € n. 
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6.3.3 Partial Plots 


Although plots of 2; versus x; and & against y can indicate defects in the model, they 
may not be informative as to what these defects are. In SLR plotting 2;,1 <i < п, 
against 2;,1 < 4 € n, can be used to detect deviations in functional form from linearity. 
However, in multiple linear regression (MLR) such plots, as with scatter plots, can be 
confused by the fact that & depends on all of the predictors, so does not necessarily isolate 
the effects of a given variable with the effects other variables removed. To remedy this, a 
number plots have been proposed which better isolate the behavior of the 7-th variable. 
Those plots may be thought of as substitutes for scatter plots in SLR. We discuss two 
such plots [87]: 


(i) Partial residual plots; 


(ii) Partial regression plots (also called added variable plots). 


Partial Residual Plots 


For partial residual plots we define 


^ 


where ê is the residual vector from the least squares fit of (5.1) and В j is the least squares 
estimate of 8;. For the partial residual plot we plot ё; against х;,1 < j < m. If the 
model is correct, then this plot should display а random scatter about the line with slope 
В;. Deviations can be used to detect violations in the assumption of linearity in ху. 

To see this, notice first that the relation between 27 and x; is of the form of the SLR 
model for regression through the origin. If the model is true, then 2;,1 < à € n, should 
scatter about the z-axis and so we can consider regressing 27 against ху. From (3.18) 
the slope of the least squares line is given by 


BN уы; ё 4 Bx; x; 
NE а (2+ Bx) (6.17) 


2 (ху,ху) 955,5) 


Now, (г + Byxa) = (ё,ху)-+ЕЙ, (х;,х;). As before, (ё,х;) = 0, so that the numerator 
in (6.17) is @ (x;,x,;). Thus, 


ў; = Bs. (6.18) 
To illustrate some of the ideas concerning the various residual plots we begin with an 
examination of the housing price model discussed in Example 5.12. Recall there that the 
square footage and age were significant in explaining house prices with an R? = 0.959. 
Further, we found that tı = 16.63 and t9 = —4.74 both of which are significant at < 0.1% 
level. However, we need to consider whether the assumptions under which these results 
were derived are valid. To do this we consider a number of residual plots as suggested 
above. As in Chapter 3, we display normal residual plots, plots of various histograms 
and various types of residual plots against fitted values and variables. We begin by 
considering the difference in using é;, f; and £;. 
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Figure 6.2: Plots of standardized residuals f; in Housing price data 
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Figure 6.3: Plots of studentized residuals f; in Housing price data 
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In Figure 6.1 we show these plots using the ordinary residuals &;, in Figure 6.2 there 
are plots using the standardized residuals f; and in Figure 6.3 the plots using the studen- 
tized residuals £;. A table of the various residuals and fitted values and their leverages 
are given in Table 6.1. 


Obs. No. 


Table 6.1 Residuals and Leverage for Housing Data 
S Resi (£j) Т Resi 


Fitted (ĝ;) 
120, 572 
101,123 
159, 137 

93, 338 
179,859 
80, 796 
82,068 
88,975 
127,116 
131,872 
179,859 
169, 043 
153, 109 
133, 569 
113,665 


Resi (é;) 
—572.1 
8876.8 
—9137.4 
—3337.5 
—4858.5 
—15795.7 
—4067.9 
11025.1 
4383.9 
4221.1 
4141.5 
—4543.4 
6891.1 
—1068.6 
3834.9 


—0.07917 —0.0758 
1.21117 1.2377 
—1.31776 —1.3642 
—0.46350 —0.4478 
—0.74759 —0.7330 
—2.29871 —2.9420 
—0.58926 —0.5725 
1.64673 1.7921 
0.61565 0.5990 
0.58567 0.5689 
0.63726 0.6207 
—0.66239 —0.6461 
1.20659 1.2324 
—0.14352 —0.1375 
0.51602 0.4996 


0.13335 
0.10868 
0.20219 
0.13963 
0.29917 
0.21654 
0.20924 
0.25621 
0.15862 
0.13536 
0.29917 
0.21935 
0.45877 
0.08021 
0.08353 
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Notice that the normal plots, I Charts and residual plots against the fitted values 7; 
are all similar in appearance. Overall, there is nothing remarkably out of line from our 
basic normality assumptions except the residual corresponding to observation 6 which 
clearly appears as an outlier with a substantial underpricing for its size and age. However, 
its leverage hg = 0.212 which is less than (2 x 3) /15 = 0.4 the cut off value suggested 
by BKW so does not appear to be overly influencing the overall fit. Perhaps there is 
something special about that house such as its condition relative to the others. If such 
information is available it should be considered before omitting this observation from the 
model. 

Although the residual plots look similar, there is some difference in the histograms. 
The histogram for é; is clearly non-normal in appearance, possibly reflecting the unequal 
variances and dependence of é; on X, while those for f; and £; are more symmetric in 
appearance with the plot for £; being most normal in appearance. Overall it appears 
that these plots reveal no substantial deviations from the basic normality assumptions. 

Further confirmation of these observations is given in Figures 6.4-6.5 which show t; 
plotted against square footage and age respectively. Again, the only feature which stands 
out is observation 6. 


Residual t; 


1400 1900 2400 2900 
Square Footage 


Figure 6.4: Plot of £; versus square footage (ft) 


Finally, to see if the outlying observation might be accounted for by some nonlinear 
behavior in the predictors we present partial residual plots for the partial residuals eT, 
c5 against ху and x2 respectively in Figures 6.6-6.7. They are striking. Both plots are 
almost perfect straight lines and should be contrasted to the scatter plots given in Figures 
5.2-5.3. In particular, the partial plot for age shows a clear decrease, for increasing age, 
in contrast to the scatter plot Figure 5.3. To verify (6.16) we regressed єї on x; and e5 
on хә and the estimated slopes were 


D; = 60.5890 and 8; = —1726.0 
which are essentially identical to the least squares estimates 
B, = 60.589 and 0, = —1726.8. 


In order to keep the number of plots within reason, we generally just display plots 
of £; against j; and the corresponding normal plots and histograms for model checking 
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Figure 6.5: Plot of f; versus age in years 


purposes and residual plots against variables when warranted. The reader is encouraged 
to obtain other plots as necessary. 


180000 


^ 


Residual € 


130000 


80000 


1400 1800 2400 2900 
oquare feet (x4) 


Figure 6.6: Partial residual plot of єї against zı 


Example 6.1 (Drink delivery data) In Example 5.14 we indicated that the time to 
delivery was quite well explained by the linear model (5.99). To further check the validity 
of the proposed model we present a table of residuals and leverage values in Table 6.2 
and residual plots of Ё; against ў; in Figure 6.8. Again, these plots are reasonably in 
accord with our normality assumptions except for observation 9 and observation 22. As 
we can see from Table 6.2 hg = 0.4983 and h22 = 0.3916 so both exceed the cutoff value 
(2 x 3) /25 = 0.24 hence appear to be high leverage points which may be affecting the 
overall fit of the model. In addition, the histogram of £; looks skewed and somewhat 
non-normal. Similar comments hold for the normal plot. Consequently, we need to be 
concerned that the linear model (5.99) is not telling the whole story. One needs to account 
for observations 9 and 22 and the possibility that another error model is more appropriate 
than normal. In fact in the OzDASL! it was suggested that a gamma distribution was 


l Australian Data and Story Library, WEB site is http://www.statsci.org 
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Figure 6.7: Partial residual plot of £5 against x2 


a more appropriate error distribution than the normal. Such distributions are members 
of the exponential family and can be fit using а modification of least squares [87]. (We 
shall briefly discuss this topic in Section 7.7 but the details are beyond the scope of this 
text.) 


Table 6.2 Residuals and Leverage for Drink Delivery Data 


Obs. No. Fitted (jj) Resi (€;) S Resi (f;) 
1 21.708 —5.02808 —1.62768 —1.69563 0.10180 
2 10.354 1.14639 0.36484 0.35754 0.07070 
3 12.080  —0.04979 —0.01609 —0.01572 0.09874 
4 9.956 4.92435 1.57972 1.68916 0.08538 
5 14.194  —0.44440 —0.14176 —0.13856 0.07501 
6 18.400 —0.28957 -—0.09081 —0.08874 0.04287 
7 7.155 0.84462 0.27042 0.26465 0.08180 
8 16.673 1.15660 0.36672 0.35939 0.06373 
9 71.82 7.41971 3.21376 4.31078 0.49829 
10 19.124 2.37641 0.81325 0.80678 0.19630 
11 38.093 2.23749 0.71808 0.70994 0.08613 
12 21.593 —0.59804 —0.19326 —0.18897 0.11366 
13 12.473 1.02701 0.32518 0.31847 0.06113 
14 18.682 1.06754 0.34114 0.33418 0.07824 
15 23.329 0.67120 0.21029 0.20566 0.04111 
16 29.663 —0.66293 —0.22270 —0.21783 0.16594 
17 14.914 0.43636 0.13804 0.13492 0.05943 
18 15.551 3.44862 1.11295 1.11933 0.09626 
19 7.707 1.79319 0.57877 0.56981 0.09645 
20 40.888 —5.78797 —1.87355 —1.99668 0.10169 
21 20.514  —2.61418 —0.87784 —0.87309 0.16528 
22 56.007 —3.68653 —1.45000 —1.48962 0.39158 
23 23.358 | —4.60757 — 1.44369 —1.48247 0.04126 
24 24.403  —4.57285 —1.49606 —1.54222 0.12061 
25 10.963  —0.21258 —0.06751 —0.06596 0.06664 
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Figure 6.8: Plots of RSTUDENT (&) versus 1; for delivery data 


Example 6.2 (Birth weight data) In Example 5.15 we concluded that the birth 
weight data could reasonably be described by the model (5.102) which represents two 
parallel lines. However, because there is а substantial amount of unexplained variation 
in the data (R? — 0.66) one might be concerned that other variables might be useful for 
explaining the data. To examine this possibility we first examine the residual plots for 
£j. Figure 6.9 shows a rather random distribution of residuals and all lie between the +2 
cut-off values. The histogram has a very normal appearance, although some skewness is 
indicated in the normal plot. Nothing in these plots suggests violation of the normality 
assumptions. For further verification we constructed the partial residual plots and they 
are shown in Figures 6.10-6.11. Аз in Example 6.1 these plots isolate the effect much 
more dramatically than the scatter plots. In Figure 6.11 the data generally fall on a line 
but there is considerable variability at some points. Since x2 is a dummy 0-1 variable, 
the partial residual plot consists of two vertical lines, with male values clearly higher 
on average than the females. Notice however, that the variability of the male residuals 
appears larger than that for females. 

Finally, we fitted the data in Figures 6.10-6.11 by least squares. For гт the slope was 
119 while for 25 the slope is 182.2. Both of these are virtually identical to the least squares 
estimates B 1 = 118.67 and Bo = 166.28. These observations indicate that the normality 
assumptions appear to be valid and the model (5.103) an adequate representation of the 
data. 


260 CHAPTER 6. RESIDUALS, DIAGNOSTICS AND TRANSFORMATIONS 


Table 6.3 Residuals and Leverage for Birth Weight Data 
Fitted (0) Resi (2;) S Resi (7;) Т Resi 


1 3225.95 —257.949 —1.5161 —1.5679 0.1204 
2 2988.61 193.610 —1.1156 —1.1225 0.0848 
3 3225.95 —62.949 —0.3700 —0.3623 0.1204 
4 2632.60 342.399 2.1533 2.3805 0.2316 
5 2751.27 —126.271 —0.7577 —0.7497 0.1560 
6 2869.94 —22.940 —0.1338 —0.1307 0.1071 
7 3344.62  —52.619 —0.3200 —0.3130 0.1783 
8 3225.95 247.051 1.4521 1.4941 0.1204 
9 2869.94 —241.940 —1.4114 —1.4477 0.1071 
10 2988.61 187.390 1.0798 1.0843 0.0848 
11 3225.95 195.051 1.1464 1.1556 0.1204 
12 2988.61 13.610 —0.0784 —0.0766 0.0848 
13 3059.67 257.330 1.4987 1.5477 0.1042 
14 2084.99 144.008 0.8789 0.8740 0.1843 
15 3059.67 —124.670 —0.7261 —0.7177 0.1042 
16 2822.33  —68.331 —0.3950 —0.3870 0.0908 
17 3297.01  —87.010 —0.5446 —0.5353 0.2243 
18 2941.00 | —124.001 —0.7143 —0.7057 0.0842 
19 3059.67 66.330 0.3863 0.3784 0.1042 
20 2703.66 | —164.661 —0.9699 —0.9685 0.1242 
21 2584.99  —172.992 —1.0558 —1.0589 0.1843 
22 2822.33 168.669 0.9751 0.9739 0.0908 
23 2941.00 —66.001 —0.3802 —0.3723 0.0842 
24 3059.67 171.330 0.9979 0.9978 0.1042 


Example 6.3 (Longley data) As we have seen in Example 5.17 total employment 
1947-1962 could be explained by the six predictors 2-1 discussed there, but the apparent 
multicollinearity in the data made it difficult to identity the important variables. Using 
a stepwise regression analysis it was suggested that an adequate model could be obtained 
using то and x3. To further consider the validity of these models we examine a variety 
of residual plots. 

In Table 6.4 we display the residuals and leverages from the full model. There appear 
to be no outliers, although observation 10 (1956) may be somewhat suspicious, and 
no apparent high leverage points. The corresponding residuals f; are given in Figure 
6.12. The normal plot and histogram appear normal but there appears to be some 
autocorrelation in the residuals. 
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Figure 6.9: Plots of RSTUDENT (ti) versus ў; for birth weight data 
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Figure 6.10: Partial residual plot for zı (gestation) 
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Figure 6.11: Partial residual plot for x2 (sex) 
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Table 6.4 Residuals and Leverage for Longley data 
Fitted (9;) 


T Resi 


1 60.0557 0.2673 1.1639 1.1906 0.4336 
2 61.2200 —0.0980 —0.4834 —0.4618 0.5587 
3 60.1270 0.0440 0.1805 0.1705 0.3610 
4 61.5934 —0.4064 —1.6909 —1.9300 0.3798 
9 62.9060 0.3150 1.6639 1.8853 0.6152 
6 63.8874 | —0.2484 —1.0252 —0.0285 0.3694 
7 65.1597 —0.1707 —0.7809 —0.7625 0.4871 
8 63.7767 | —0.0157 —0.0731 —0.0689 0.5046 
9 65.9988 0.0202 0.0902 0.0851 0.4596 
10 67.4005 0.4565 1.8287 2.1750 0.3308 
11 68.1887 —0.0197 —0.0808 —0.0762 0.3607 
12 66.5520 —0.0390 —0.1778 —0.1679 0.4840 
13 68.8042 —0.1492 —0.6154 —0.5928 0.3688 
14 69.6493 —0.0853 —0.3181 —0.3016 0.2283 
15 68.9915 0.3395 1.4050 1.4992 0.3730 
16 70.7612 . —0.2102 —1.2282 —1.2692 0.6854 


In Figure 6.13 are shown the residual plots for Ё; for the “best” model selected by the 
stepwise procedure. Again, the normal plot and histogram look reasonably normal but 
now observation 10 appears as an outlier and the I Chart and residual plot display much 
more pronounced autocorrelation. The overall impression is that the effect of time has 
not been properly accounted in the reduced model. We will return to this matter later. 


Partial Regression (Leverage) Plots 


Partial residual plots have been criticized by some statisticians for overestimating the 
effect of x; on the fit and other plots have been suggested as alternatives. An important 
plot, called a partial regression or added variable plot is widely advocated as an alternative 
(or complement) to partial residual plots. To motivate these plots, consider the problem 
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Figure 6.12: Plots of residuals for full Longley data 
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Figure 6.13: Plots of residuals Ё, for 22,13 (best model) 
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of deciding whether to add a new explanatory variable to the model (5.1) and we wish 
to estimate its effect. This will be determined by fitting the augmented model 


Ү= ХВ +уүж += (6.19) 


obtained by appending w to the design matrix. Hence we can write (6.19) in partitioned 
form as 


Y = [x|w}(-2) +e 
= Xf, =XB+w+e (6.20) 
The model (6.19) is fit by least squares giving the estimate (& | 4)’ where & is the 


revised estimate for 8 in (6.20). (Generally, à Z В, unless w is orthogonal to the columns 
of X). We now derive a formula for 4. From (6.20) 


^ 


Êo = (XIXu) XL$ (6.21) 


or equivalently B,, satisfies К 
(XZX,) д, = X75. (6.22) 


Now from (6.22) 


x? XTX + X?w 
T 
Thus, & and ¥ satisfy 
X"X&à c (X'w)5 = X'y, (6.24) 
мха + (жм) ў = wly. (6.25) 
From (6.24) 
а = (XTX) | XTy - (XTX) ` X7w4 (6.26) 
and substituting (6.25) into (6.24) gives 
wTX (XTX) ` XTy —w?X (XTX) ` XT w4 + (wTw) 4 = wy. (6.27) 


Thus, ¥ satisfies 


[wT w - wTX (XTX) " XTw|4 = wly—w!X (XTX) ' XTy 


wT [I ex X) xT] y (628) 


so that 
, ж (I-H)y _ (w (1-H)y)  (I—-H)w,(I—-H)y) 
тону (жй) Aa Aaa aa — 99 


But (I — H) y = y – ӯ = ё, so that 
(£, (E — Н) м) 


СТЕ H)w,(1— Н) м) (6:30) 


> 
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From (6.29) and (6.30) we see that ^ is the slope of the least squares line obtained by 
regressing € on Wres = (I — H) w, the residuals obtained by regressing w on the columns 
of X. 

Now reversing the argument, suppose that w is a column of the original model, say 
xj. Then, using (6.22) with w = x; and X = X(. jj, where X,_,) is the design matrix 
with x; deleted, then the least squares estimate 8, of B; is given by 


8, = stop tires) (6.31) 

2,тез› Aj, res 
and again this is the slope of the regression line obtained by plotting the residuals from 
the model without x; against х; res, which is x; with the effect of x;, i Z j removed. This 
argument suggests plotting the residuals £(..;j against xj;,res,1 < j < m. From (6.31), it 
follows that if the model is true, then this plot should display a scatter of points about a 
line through the origin with slope б> As before, deviations from linearity in ху, show up 
in a plot differing markedly from linearity. In essence, these plots play the role of scatter 
plots for SLR. 

In addition, it can be shown that high leverage points in the response can be found in 
the extreme values of х; „ез [87]. Another interesting consequence of the formula (6.30) 
for ¥ is that it shows that multiple regression can be considered as a sequence of simple 
linear regressions by successively regressing an additional variable using the residuals 
from fitting the previously variables. 


6.4 PRESS Residuals 


So far in our discussion of residuals, we have dealt with residuals which arise from model 
fitting. However, if we are interested in using the model for prediction in addition to 
explanation, it would be useful to have a way of quantifying how well the model predicts 
in addition to how well it fits the observed data. One could use confidence intervals 
and prediction intervals, but unless we know exactly which points we wish to use, this 
approach does not seem to provide a convenient method for practical application. Per- 
haps the simplest approach to measuring the predictive power of the model would be to 
examine the residuals obtained by predicting values at new points. This is a “Catch-22” 
since we generally have no data at these points. 

А convenient proxy for measuring the error at a new point is to use the data itself for 
this purpose. А reasonable approach, as indicated in Chapter 3, would be to omit one 
observation from the data, fit the model without this case and compare what the model 
predicts there, compared to the observation. If we omit the i-th observation, let $(..; be 
the predicted value from the model with the i-th observation removed. Then the value 


Eii) = Yi — 904), печат (6.32) 
is called the i-th PRESS residual. (PRESS is short for prediction error sum of squares.) 
The sum of squares 

т, 
PRESS = У г, (6.33) 

i—1 
is a useful measure of predictive accuracy. Before discussing the use of PRESS as a 
diagnostic for model validity, it is interesting to consider the computation of €(_j),1 < 
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i < n. If n is large, this looks like a daunting amount of computation, even with modern 
computers, since it appears that one reads to fit a new model to compute each £y. j). 
Remarkably, this turns out to be unnecessary. Ás we will see, it can be shown that 


А ё; 
êp = = 6.34 

(C9 = тв) (6.34) 
where ё; is the residual from fitting the model to all of the data. Hence, all of these 
residuals can be obtained from factors obtained entirely from the original fit. Because 
these deletion statistics play an important role in modern regression analysis, we make 
a brief digression to establish (6.34) and some related results. 


6.4.1 Deletion Statistics 


We will refer to any of the regression statistics obtained by omitting one or more obser- 
vations as deletion statistics. As we show, the trick to obtaining efficient formulas for 
these is to apply the Sherman-Woodbury-Morrison theorem in Section 4.4 to the design 
matrix X with one row removed. 

Let X(_;) denote the design matrix resulting from deleting the i-th row, then the 


—1 niet 
basic result we require is a formula relating XP) Xa] and (хТх) zi 


Theorem 6.1 Let x; denote the i-th row of X, then, if hi; #1 
(XTX) xTx; (XTX) ` 


Emm (6.35) 


ed 
T _yTy-1 

[XTX] =XTX + 

where hj; is the i-th diagonal element of H. 


Before proving Theorem 6.1 we make the following crucial observation which relates 


XLX- to ХІХ, 1.е., 
ХХ = XX + — xix. (6.36) 
(т+1) х(т+1) (m-+1)x(m+1) (m+1)x(m+1) ` 


For notational convenience, we assume that i = n which can always be achieved by 
permuting the rows of X. Then the ij-th element of XTX is given by 


n—1 


(XX), 2» TkiXk; = У? Tkilkj t Lnilnj- (6.37) 
k=1 


Now the ij-th element of хЇх„ is z,;7,; and S е 2р2; ls the 23-6) element of 
X nyX(—n): Hence, (6.36) follows. 


Proof of Theorem 6.1. Letting A — XTX , B — XC 4XC3 апа z = xf in 
Theorem 4.5 gives 
x (XTX) x1 x; (XTX) 


6.38 
1 — x; (XTX) !xT uu 


E 
IXLaXco| = (хт) 
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But һи = x; (XTX) xT so that (6.35) follows. Ш 


Theorem 6.2 Let ê; be the i-th PRESS residual, then 


^ 


Ei 


<: < п. 3 
Ic) l<i<n (6.39) 


6-0) = 


Proof. Let Bia) be the coefficient vector of parameter estimates omitting the i-th 
observation. Then, 


А bs EL. us 
Bia; = IXLaXcs| X( Y (-1) (6.40) 
where уү_;) is the observation vector y with y; omitted. Then, 
A А v E 
y(—i) = (xi, дса) = (x, IXCaXco| XLoyco) : (6.41) 
Using (6.38), 
-1 
29) |х ох, XLoyca) 


(uy + ST AE) 


1 — hii 


a 


E 
= (xi, (XTX) XP ayc-o) 2x 1— hij 
$1 + S2. к 


(xi, (XTX) xfx: (XTX) XT yy) 


Now using the fact that хе» У(—;) = ХТу – ӊхТ, 


51 


(xi, (XX) (XTy – wxT)) 
(xi, (ХЕХ XTy) — Yj (xi, (XTX)" хг) 


= (xi,B) — hüyi 
= 0: — №. (6.43) 


Similarly, the numerator of S2 equals 


a (XTX)" x1 x; (XTX) ` y) — yi (x (XTX) ‘xIx; тку) 
ЧЕЧЕНИ a 1 — Se ay 
hi $i hii hii 


Thus, 
hihi — hii 


A 
gom (6.45) 


$(-) = 9i — hüyi + 
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A little elementary algebra then shows that 


Yi — Yi 
1— hii 


yi — c4) = (6.46) 


which is (6.39). M 


For future reference we need analogous formulas for B- Bi» and 55 E. 


Theorem 6.3 (i) As above, let By be the least squares estimate of В with the i-th 
observation deleted, then, 


"M XTX) ‘хте; TER 
B- Bii) = LBL = (XTX) жї (6.47) 


К 2 
(ii) If SSE; = 30554 [Ж — xT Êa] is the error sum of squares for the fitted 


model with the i-th observation deleted, then, 


a2 
T 


SSE) = У ê; - е (6.48) 
j=l 11 


Proof. (i) As in Theorem 6.2 


А A ЕТ 
Bos = (XTX) XY) 
—1 _1 
SG ра ЕСЕИСТ RES 
С 1 — hi 
ui i (6.49) 
As before, 
Sı =Ê- yi (XTX) X"xf (6.50) 


(XTX) a N Ryna S хт, (ХТХ) ‘хт 


Ü 


= (XTX) x? x, – pi (XTX) x? hi. (6.51) 


Thus, 


(XTX) xP 9; — hay (XTX) ‘хт 
1 — hii 


1— hii 


Ji — Yi 
= и) (6.59) 


= (XTX) ‘хт ЕЕ 565 


6.4. PRESS RESIDUALS 269 


Thus, Bi» – 8, = (хх x; €(_;) and changing signs gives (6.47). 
(ii) By definition, 


ыр = (усә = x(-9B o Ycà Е x(-) Bia) 
Е 25 lu; = x 
j-ljzi 
= 2. lv; = хд n lvi -= xiba] ; . (6.53) 
j=1 


From (i) 


(6.54) 


= 2 ~~ 2 
í zo (MOR) xE xi (XTX) "жг; 
ees SS Lom yi — Xi + —___—_——| . (6.55) 
Using the fact that hi; = x; Oxy x1, this simplifies to 


а у hat; 9 PL ү ё? 
Sr) ia) е лш. 


j=l jest 
Expanding the sum in (6.56) gives 
а 22; < г? A г? 
42 i ^ i 2 1 
ес —— hijêj + 2—5 УО - —+. (6.57) 
2, 1!  1— hj 2. (Lo hii) j=l ` (1— hj) 


From (6.11) pau hj; = hy and using the fact that HY = Hy = Hê =0 (because ӯ = 
Hy = Hy = Н?у = Hy) $75 4 hijé; = 0. Using these in (6.57) gives 


n 2 a2 
& hy; ё 
S5 Ei = 22 + уе — 1 
2, ^ (= А)?  (1—- ha) 
n 2) 
x5 Ei 
= 2 DET (6.58) 


as required. M 


Corollary 6.1 For a model (5.1) with m + 1 parameters 


Е [SSE(.] = (n—m-— 2) 0°, (6.59) 
so that IgE 
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is an unbiased estimator of о?. Thus 


^2 | (1- his) (n-m-—-1)8* – ê? 
09 =" ha) (nm —2) Pon 


where s? is the SSE for the full model. 
Proof. We prove (6.59), (6.60) and (6.61) are obvious consequences of (6.59) and 
(6.60). Since E (22) = Var (2;) = o? (1 — hii), 1 < i € n, it follows from (6.48) that 


E [SSE] = $ (1—Һ;;)о?—о? 
j=1 


i 


а? (n—1) - о? M hjj. (6.62) 
j=l 


But 575 hjj = tr(H) = m +1 as follows from the argument in Theorem 5.4. Thus, 
E |SSE, jj] = (n — m — 2) о?, as required. B 


We make several comments concerning the results in Theorems 6.1-6.3. Since it can 
be shown that 5S E(..; and ё; are independent random variables and that SS E, isa 
X? (n — m — 2) random variable (see Section 2.8), and é;/./1 — hi; is N (0,1), then, 


Кы аа (6.63) 
ves) 
has a t-distribution with n — m — 2 degrees of freedom. Of course í? is the same as 
RSTUDENT introduced in Section 6.3. As noted there, because the distribution of Ё? is 
known, one can use RSTUDEMT to conduct tests of hypotheses concerning a particular 
observation being an outlier. 
There is also an interesting relation between the PRESS residual and RSTUDENT. 
From (6.39) E (ê) = 0 and Var (êç) = o?/ (1 — hi). Hence, if we define a stan- 
dardized PRESS residual 


бу  &l(l-ha) . 2; 


o (êa) 9/1 һа  c/vl—ha 


A similar calculation shows that using (i) to estimate c?, the studentized PRESS 
residual is given by &;/6(. jj /1 — hii = RSTUDENT. Hence, plots of studentized PRESS 
residuals are the same as plots of RSTUDENT. 

Further, we note that formulas (6.63) and (6.64) have interesting diagnostic infor- 
mation. From (6.39) we see that if the i-th observation point has high leverage, then 
E(—i) will generally be much larger than ё;. Since high leverage points are fitted well, as 
measured by гү; they may predict poorly. Again, this is another manifestation of the 
fit/prediction dilemma. How to account for both is a continuing conundrum in statistical 
analysis. 

This same phenomenon shows up in (6.47) for 8, — Bi», which shows that the 
influence of the i-th observation depends again on &(. jj. So, it can be “small” if the fit 
is good but large if h;i; is large. Here, again, fit and leverage act in opposite directions. 
How to reconcile this is considered next. 


= #,. (6.64) 


6.4. PRESS RESIDUALS 271 


6.4.2 Influence Diagnostics 


Until recently, the emphasis in multiple regression analysis has been to determine the 
appropriate variables to include in the model (5.1). From a data point of view, the 
emphasis was placed on the relations among the columns of X. This has already been 
well illustrated in Chapter 5 and will be further elaborated on in Chapters 8 and 9. 

Since we only have a finite sample of (X, y) values to work with, even if the model is 
perfect, it might be possible to obtain different conclusions if a different sample (X’, y") 
was obtained. Unfortunately in most situations we will only have the original data 
(X,y) to work with and it then becomes important to consider how this particular 
sample affects our conclusions. Hence, this leads us to focus on the effects that the 
rows of X have on the estimated model. Of course if we had multiple samples we could 
compare the effect of different observations on the estimated model. Since generally 
this cannot be done, we must look for a different approach similar to the use of PRESS 
residuals. That is, we consider omitting variables one or more at a time to see how В ара 
y change. Clearly, even if the number of observations is moderate, this can lead to the 
examination of large amount of data (if n = 10, there could be 1023 different regressions). 
However, we can compute the effect of deleting observations without having to calculate 
any additional regressions, as we have already done for computing the PRESS residuals 
in the previous section. These were required for the computation of В — В = As we 
shall see, these formulas are useful for investigating the effect of individual data points on 
the estimation of [З and the fit y. Although formulas can be obtained for the deletion of 
subsets containing more than one observation, we will only consider single point influence 
statistics here. 

Before proceeding with the more formal analysis, we discuss the main reason for doing 
this. Since В and ӯ both depend on (X, y) it is important to know whether or not all 
points are contributing equally to the estimation or whether there are points which are 
unduly affecting it. For example, points with large leverage h;; indicate that the i-th 
observation overly influences the fit, while points with large residuals suggest possible 
model inadequacy. It is important to know if these discrepant points should be used, set 
aside or reexamined further as to their occurrence. Аз we have already observed, high 
leverage can mask residuals when using f; or t examination of these alone may not be 
sufficient to detect unusual influential points. 

As we shall see, examination of the statistic В — Bi and y — y_i) enable us to 
combine these two competing factors to mathematically assess the influence of the i-th 
observation (x;, yi). 


6.4.3 Influence on В, Üi 


Since B — By clearly measures the effect of deleting the i-th observation on B, this 


should form the basis of influence diagnostics of x; or В. First, if the effect on an 
individual coefficient is desired, one can then consider the j-th component of 


(ô TS Bi) = 8; = Deor (6.65) 


j 
Using 8 — By = (Кох x/é;/(1 — hy) as follows from (6.65) we get 


TjiEi 


8;—Й(— = EE (6.66) 
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where r;; is ji-th element of R = (XTX) X 

One generally considers the i-th observation influential on the j-th coefficient 8, if 
the quantity in (6.66) is considered to be “large.” Since 8; is a random variable, large 
should be measured relative to the standard error с; of B j» Which is с/б; where бу is 
the j-th diagonal element of (кх. If we estimate суу; by 6 (1) (05 then dividing 
(6.66) by this leads to the influence statistic DFBETAS, ; 


Ві bos _ rici 
6 (Li) 9; 6 (Li) / Ôj (1= hag) 


which gives the number of standard errors that the coefficient changes if the i-th ob- 
servation were set aside. A large value (in magnitude) of DFBETAS; ; indicates that 
the i-th observation has a sizable impact on the j-th regression coefficient. The sign of 
РЕВЕТАЗ; ; may also be meaningful. For example, if DFBETAS, ; in (6.67) is negative 
and relatively large in magnitude, it is likely that the negative coefficient can be attibuted 
to the ?-th observation. 

Furthermore, 


DFBETAS;,; = (6.67) 


=1 =i 


RR? = (XTX) * XTX (XTX) `= (XTX) = ВТЕ = ô (say). (6.68) 
Therefore, 6; = rfr; where ry denotes the j-th row of R. Using this (6.67) can be 


2 
written 


Tu E; 
DPBETAS mas. Ll (6.69) 


where £; is the RSTUDENT. Again, DFBETAS;; measures both leverage h;i; and the 
effect (errors) of a large residual. 

In [8] Belsley, Kuh and Welsch suggest a cut-off value of 2/,/n for DFBETAS, ;. If this 
quantity is exceeded then observation i is considered to be influential on the estimation 
of 8;. Since DFBETAS;,; is the ji-th element of an (m+ 1) x n matrix this can lead to 
a substantial amount of output to examine even for moderate value of (m, т). 

To mitigate this problem it is somewhat easier to examine the effect of the i-th 
observation by considering its effect on the whole coefficient vector В. This can be done 
by using some norm of the vector 3 — By as suggested by Cook in [18, 19]. The general 
form of this statistic is 


p,- „о. с бй 


where M is a given positive definite matrix and с is a normalizing constant. The most 
popular choice of M is (XTX) and с = s?, the usual unbiased estimate of variance. In 


this case "a NET 
— B. a, (X*X)(8— By; 
т.е ad cn 


which is called Cook’s Distance. In computer output that is usually referred to as Cook’s 
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D. For computational purposes using (6.71) 


(B- Bos, (XTX) (B-BL»)) = 


А 2 
an 
= his 6.72 
Hence 
Do dá (6.73) 


(1— ha)" (m + 1) ô? 


ge hj 
fum " 11 ; : 4 
a т (==) ou 


Hence, no new calculations need to be done, since one generally will have computed £; 
and hi; in the course of the analysis. 

Also, note that D; combines the effect of fit through the residual £; and leverage 
through hii. Since hj;/ (1 — hii) is an increasing function of hu, D; can be large if either 
fi and/or hj; is large (i.e., close to one). Again, this points out the need to isolate those 
points with either large residuals (outliers) or high leverage. 

As before, the question arises for the need to have some cut-off value for which D; is 
to be considered large. 

By comparing the form of D; to that for the F-statistic used to compute the joint 
confidence region (5.282) for 8, it is suggested that values of D; > 1 are to be considered 
large. Hence, we consider every point with D; > 1 to be influential on the estimate @ 
and should be set aside for further examination. 

Finally we note that D; can be written in the alternative form 


$—$30C0$-$o09) 


"m. 
D; = EDS (6.75) 


which follows easily from (6.71). We leave the details to the reader. Hence, D; may also 
be considered as a statistic to measure the influence of the i-th observation on the overall 
fit. 

One may also consider the influence of the i-th observation on the individual fitted 
values $;,1 <i € n. For this we define 


Yi — I-i) 


DFFITS;--—-———— 
O(-i)V hii 


(6.76) 


where the denominator is an estimate of с (£) А 
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Using an argument similar to that for D; we find that 


he NV? 
DFFITS; = fj ( E ) . (6.77) 
1 — hj 

Again, we see that both fit and leverage determine the influence of the i-th observation 
on the fit. If either f; is large and/or hi; is large (close to one) then DFFITS; will be 
large. However, a large residual f; (outlier) can be masked if hy = 1. 

In [8] Belsley, Kuh and Welsch suggest a cut-off value of 24/ (m + 1) /n for DFFITS; 
to be large. Other influence statistics are discussed in [8, 87, 20]. We now examine a 
couple of examples to see how one can use these diagnostics in practice. 


Example 6.4 (Drink delivery data) Consider again the drink delivery data of Table 
5.3. Table 6.5 shows the leverage h;;, Cook's D, and DFFITS values, illustrating the 
influence of all 25 observations without removing any of them. 


Table 6.5 Ai, D; and DFFITS for the data in Table 5.3 
; DFFITS 


DFFITS 
0.00329 0.0974 
15 0.04111 0.00063 0.0426 
16 0.16594 0.00329 0.0972 
17 0.05943 0.00040 0.0339 
18 0.09626 0.04398 0.3653 
19 0.09645 0.01192 0.1862 
20 0.10169 0.13244  —0.6718 
21 0.16528 0.05086 | —0.3885 
22 0.39158 0.45105  —1.1950 
23 0.04126 0.02990 | —0.3075 
24 0.12061 0.10232  —0.5711 
25 0.06664 0.00011 . —0.0176 


1 0.10180 0.10009 
2 0.07070 0.00338 0.0987 
3 0.09874 0.00001 0.0052 
4 0.08538 0.07765 0.5008 
5 0.07501 0.00054  —0.0395 
6 0.04287 0.00012 0.0188 
7 0.08180 0.00217 0.0790 
8 0.06373 0.00305 0.0938 
9 0.49829 3.41932 4.2961 
10 0.19630 0.05385 0.3987 

11 0.08613 0.01620 0.2180 
12 0.11366 0.00160 —0.0677 

0.06113 000229 


First, in order to consider the leverages, we calculate the cut-off leverage point, which 
is 2(m + 1) /n = 2(3) /25 = 0.24. Based on this criterion, observations 9 and 22 are 
high leverage points. Since the hat diagonal matrix provides a measure of standardized 
distance from the point x; to x (and the reader should note that h;; does not involve the 
y’s), these two observations exert undue influence on at least one regression coefficient 
as well as other performace criteria. 

Now consider Cook’s Distance, which considers both the location of the point in the 
х-ѕрасе and the response variable in measuring influence. From Table 6.5, Dg = 3.42 
shows the largest value, which indicates that the observation 9 is definately influential. 

From (6.77), the cut-off value for DFFITS; for the drink delivery data is 24/ (2 + 1) /25 
— 0.6928. Inspecting Table 6.5 we notice that observations 9 and 22 have values of 
DFFITS that exceed the value, and also |DFFITS.9| is close to the cut-off value. 

The DFBETAS; ; (j = 0,1,2) values for all 25 observations were calculated and given 
in Table 6.6. The cut-off value is 2/4/25 = 0.4. We then instantly notice that observation 
numbers 9 and 22 have large effects on all three regression coefficients as well as the 
quality of fit. As we see, observation 9 has a very large effect on the intercept and 
relatively smaller effects on 3, and 85, while the observation 22 has its largest effect 
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оп ді. Besides these, several other observations produce effects оп the coefficients that 
are close to the formal cut-off, including observations 1 (on 8; and 95), 4 (on ĝo), and 
24 (on By and Ba). These points produce relatively small changes in comparison to the 
observation 9. 

From the diagnostic point of view, clearly the influence of observation 9 is evident, 
since its deletion results in a displacement of every regression coefficient by at least 
0.9 standard deviations. The impact of observation 22 is much smaller. Furthermore, 
deleting observation 9 displaces the predicted response by over four standard deviations. 


No. 


Connor WHY ҥе. 


Table 6.6 DFBETAS; ; values for the data in Table 5.3 


Intercept 
(j — 0) 
—0.1873 
0.0898 
—0.0035 
0.4520 
—0.0317 
—0.0147 
0.0781 
0.0712 
—2.5757 
0.1079 
—0.0343 
—0.0303 
0.0724 


Casses 
(j - 1) 
0.4113 
—0.0478 
0.0039 
0.0883 
—0.0133 
0.0018 
—0.0223 
0.0334 
0.9287 
—0.3382 
0.0925 
—0.0487 
—0.0356 


Distance 
(j =2) 
—0.4349 
0.0144 
—0.0028 
—0.2734 
0.0242 
0.0011 
—0.0110 
—0.0538 
1.5076 
0.3413 
—0.0027 
0.0540 
0.0113 


No. 


Intercept 
(j = 0) 


—0.0168 


Cases Distance 
G=1) (j-22) 
—0.0671 0.0618 
—0.0048 0.0068 

0.0644 —0.0842 
0.0065 —0.0157 
0.1897 —0.2724 
0.0236 —0.0990 
—0.2150 —0.0929 
—0.2972 0.3364 
—1.0254 0.5731 
0.0373 —0.0527 
0.4046 —0.4654 
0.0008 0.0056 


Example 6.5 (Housing data) Consider the housing price data. 


regressors, and the data were used to illustrate the influence diagnostics. 


shows residuals and some of those essential diagnostic statistics. 


There are two 
Table 6.7 


Table 6.7 Residuals and Diagnostic Statistics for Housing Data 


No. Res 2; T Res t; hii Cooks D DFFITS 
1 —572.1  —0.07582 0.13335 0.000321 —0.02974 
2 8876.8 1.23772 0.10868 0.059620 0.43219 
3 —9137.4  Á—1.36422 0.20219 0.146691  —0.68677 
4 —3337.5  —0.44779 0.13963 0.011621 —0.1804 
5 —4858.5  —0.73303 0.29917 0.079524 . —0.4790 
6  —15795.7 -—2.94203 0.21654 0.486842 . —1.5467 
7 —4067.9 —0.57252 0.20924 0.030626 —0.2945 
8 11025.1 1.79205 0.25621 0.311363 1.0518 
9 4383.9 0.59898 0.15862 0.023819 0.2601 
10 4227.7 0.56893 0.13536 0.017900 0.2251 
11 4141.5 0.62072 0.29917 0.057784 0.4056 
12 —4543.4  —0.64612 0.21935 0.041095 —0.3425 
13 6891.1 1.23240 0.45877 0.411341 1.1346 
14 —1068.6 —0.13753 0.08021 0.000599 —0.0406 
15 3834.9 0.49962 0.08353 0.008090 0.1508 
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We first calculate the cut-off leverage point, which is 2 (m + 1) /n = 2 (3) /15 = 0.4. 
Data point 10 is closest to the cut-off leverage point. The cut-off value for DFFITS; 
is 2\/(2+ 1) /15 = 0.8944. Data points 6, 8 and 13 have larger values than the cut-off 
value. These are considered to be influential. 

The DFBETAS;; (j = 0,1,2) values for all 15 observations can also be calculated 
using either (6.67) or (6.69). We leave these to the reader for an exercise. The cut-off 
value is 2/ /15 = 0.5164. 


6.5 Transformations 


As for simple linear regression, if examination of the model using the techniques of 
Chapter 5 and this (or others) indicates that modifications should be made, then various 
transformations may be required to provide a more adequate representation of the data. 
This can require transforming the data in y or X, adding or deleting variables and/or 
observations, changing the error structure and/or changing the method of estimation. In 
this section we focus on transforming the existing data and modifying least squares to 
account for violations in the basic assumption of the GLM. 

In Chapter 8 we will consider further modeling issues concerning the addition of new 
variables while in Chapter 9, we will consider modifications to deal with multicollinearity. 
As there will be some overlap with our discussion Chapter 3 we will refer there for details 
as necessary. 


6.5.1 Transformations in x 


If various diagnostics or theoretical considerations indicate that the dependence of y on 
x is not linear in one or more of the independent variables, then it may be appropriate 
to reformulate the model by transforming the variables in some fashion as indicated for 
SLR in Chapter 3. For example, suppose we suspect nonlinearity in the j-th variable x; 
in (5.1), then there are a number of ways to proceed. 

One can replace ху by a variable 2; = f (жу) so the model (5.1) becomes 


Y Bg partert Ж ub De es УУ Еа? (6.78) 


If the functional form z; = f (x,;) is known, then (6.78) is a linear model and can 
be fit and analyzed as we have already discussed. If this transformation is appropriate 
over our initial assumption, then this should show up in improved fit statistics, such as 
R?, t values, increased F and reduced curvature of residual plots involving z; over those 
involving x;. Unfortunately, the functional form z; = f (rj) is usually unknown, so an 
analysis generally has to include methods for doing this. A variety of exploratory methods 
can be found in [115]. A reasonable approach is to parametrize the unknown function 
in some fashion and then estimate these parameters along with 8;,0 <i < m,i £j. 
А typical parameterization is z; = 27, where А is an appropriate real number. If 2; is 
positive, then all values of А are permissible while A has to be restricted if z; takes on 
negative values. A typical range of A is —2 < A < 2. 


Since polynomials of suitable degree can be used to approximate fairly arbitrary 
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functions, one might consider looking for 2; as a polynomial in ту, i.e.; 


i 
Zj = У; TS (6.79) 
k=1 


where ry,1 < k < l have to be estimated. Using this in (6.78) the model becomes 


l 
Ү= 8+8 +:::+68;_12;5-1 + B; > z +++ Bmim FE. (6.80) 
К=1 


Unfortunately this is not a linear model in the parameters 8;,0 < i < m,i Æ j and 
rkl € k € l. Since there are a number of pitfalls in using polynomials, we will reserve 
our treatment of this until the following Chapter. Other functional forms may be used 
as well, such as trigonometric functions and piecewise polynomials (splines). Again we 
will take this up in Chapter 7. For now we concentrate on the power family. 


6.5.2 The Box-Tidwell Method 


One approach to estimating A in z; = т^ is to choose a range of А, fit the model, for 
each value of А and then choose the model which provides the “best” fit, say the one 
with the smallest SSE or largest R? or F. For a large range of А, this may require a 
substantial amount of computation and/or analysis. In addition, because the behavior 
of test statistics as a function of A is not known, this procedure could miss the best if 
it was not in the original parameter range. A more automatic procedure, analogous to 
the Box-Cox approach is to try to choose А in some automatic fashion. We discuss one 
such procedure here; the Boz- Tidwell method [12]. 

Suppose we assume that А is not too different from А = 1. Then expanding z^ in a 
Taylor series about A — 1 gives 


^ 1 
~ —1)— . 81 
тоет + (A )-® - (6.81) 
From calculus, 
NS MEE 


so that dz^/dA]|, = rlogz. 
Thus, 
т^ т + (А – 1) 21052 (6.83) 


and using this approximation in (5.1) ће assumed model is of the form 


ү = Bo + B12 +... + 83123-1 + 8; [2:; + (А — 1)z; log z;] +... + ОЗ +e 
= Bot X Byte + 8; (A — 1) z; log z; + є. - (6.84) 
k=1 


Letting 6,41 (А) = 8; (^ — 1) (6.84) is a linear model in the variables 8,,0 < Е < 
m + 1. Since B, = (A — 1) Bj we estimate (А, 8,) in the following way. First, fit the 
original linear model and obtain the least squares estimate B j of 85. Then fit the enlarged 
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model with zm41 = 2; 1052; by least squares and obtain the least squares estimate of 
Dose Using the relation Bun = (А — 1) B; gives 


A= (Bms /B;) +1, (6.85) 


where 8, +1 is least squares estimate of 8,,,,. This allows one to test for the need for a 
transformation: 
Ho:=1 against H1:AZz1 (6.86) 


by using the £ test to test for G,,,, = 0. In addition, a 100 x (1 — a) % CI for А is given 
by 


(Brot F 1) + ta,n—m—2° 8 (6.87) 


where s = y SSE/ (n — m — 2). 

In Atkinson's modification of this procedure, one applies Andrews' approach to the 
transformed data z( in (6.93). Further details can be found in [20, 27]. 

Again, the new model (6.84) with À given by (6.85) can be analyzed as before. As 
indicated in [20] it is useful to examine the added variable plot using the constructed 
variable z;logz; as a diagnostic for a transformation. A distinct trend in the plot 
suggests that А Æ 1 in (6.84). 

If the model (6.84) with À appears inadequate, one can obtain successive estimates 
À (D) ,1 > 1 by letting А (0) = А in (6.85) and then getting a new estimate by expanding 
x} about A = A(0) giving т ~ 27 + {rA- A (0)) 23 logz; and substituting into 
(6.84) to give 


Ү= 8+ У) Byrn булу - B; [A – A (0)] 220 log; +e. (6.88) 
k=1,k4j 


Then, fit the model (6.88) with and without the added variable z,,,;, = gu logz;. If 
B; (1) and 8,, (1) are the least squares estimate of B; and буу then 


ÀQ)— 1-544 (1) /8; (0). (6.89) 


Again, one can further iterate if the method is converging, ог stop after a fixed number 
of iterations if not. 

This method can be easily generalized to account for simultaneous transformations in 
two or more predictors and to consider transformations from А = Ag rather than А = 1. 
(Just use (6.88).) 


Example 6.6 Consider the data set given in below. Suppose that we have a multiple 
regression problem with m = 2 predictors and we comtemplate transforming one of 
them, say 22. We apply the Box-Tidwell method to these data and examine the need to 
transform 22 by fitting 

y = bo + 8,21 + Bota + E. 


The overall F-statistic is 1621.74. Clearly, the model is significant at 196 and the results 
are 8, = 0.32071, its standard error is 0.02919, and t-value is 10.99. 
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We now consider an enlarged model 
у = Bo + Bi 21 + b222 + 85 (^ — 1) £2 log z2 + €. 


Then, the estimate of 8, (A — 1) is —0.1593, its standard error is 0.09812, and the t-value 
is —1.62. (To provide enough evidence of the need to transform the predictor, the t-value 
is compared to £(v), for this example v = 10.) 


Hence, 
—0.1593 


0.32071 


Rounding to a convenient multiple, we choose Axl /2 and take the square root of тә. 
Thus, the fitted model is given by 


yi = Bo + 8,21 Bo /22 + €i, 


so the estimated regression surface is 


im + 1 = 0.5033. 


jj = 0.984 + 1.0724 + 0.951 /27. 


6.5.3 Transformations of y 


As for SLR, transformation of y in (5.1) is suggested by curvature in residual plots, in 
particular those of ê against y. In general, transformations in y result from the following 
factors: 


(i) the relation between Yx and X is not linear; 
(ii) the variance of Y, is not constant; 
(11) the error model for Y, is incorrect. 


Often (ii) is a consequence of (iii). For example, in many situations the dependent 
variable Y, is discrete, rather than continuous as implied by the assumption of normal 
errors. A typical, and increasingly important case is that of logistic regression. Here, Yx 
may only take on the values 0 and 1, to record the success or failure of an experiment. 
In this case Y, is a Bernoulli random variable, with E (Yx) = px = P (Y, = 1). 
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Since Var (Yx) = px (1 — px), the variance of Y, will generally not be constant if py 
depends in a nontrivial way on the independent variable x. Many other such situations 
occur in practice, such as count data, which typically follow а Poisson distribution [40] 
or in the analysis of survival times, when the error distribution may be exponential. 

In this section we will consider transformations which deal with (i)-(ii), when data 
can be transformed to the normal model and in Chapter 7 we will consider some aspects 
of (iii). 


6.5.4  Linearizable Transformations 


As for SLR, а common approach to transforming Y, is to assume that there exists a 
transformation which linearizes the model. As a typical case, as in Chapter 3, is the 
assumption that 


y — exp | 9 8,2; (6.90) 
j=0 
so that m 
logy = DAR (6.91) 
j—0 


Of course, as we pointed out in Section 3.10, taking logarithms in (6.90) will not 
necessarily transform the model to one with N (0, c?) errors, unless the distribution of Y 
is lognormal. Otherwise, transforming Y transforms the model to one with a complicated 
error structure. In such cases, other estimation techniques, such as nonlinear regression 
may be needed to fit the original data. 

If one refers to Table 3.24, any functional form given there can be used to linearize а 
nonlinear model but the same remarks concerning the error structure should be accounted 
for. If such a transform is suggested, one can fit the transformed model and then evaluate 
its appropriateness by the various plots and diagnostics discussed in this Chapter. 


6.5.5 Box-Cox Transformations 


As in Section 3.10, if a linearizing transformation is not immediately apparent, then one 
can use the Box-Cox method to determine a transformation of power type, if the data 
4,1 < i < л, are positive. As there, we assume that 


ys ios ue (6.92) 
log Y, А = 0, 


is N (0, с?). Then one can use MLE to estimate 3 (А), с (А) and A. Essentially repeating 
the calculations given in Section 3.10 this can be done by fixing А and then regressing 
the modified data 


now 
гү = pa (6.93) 
where y = TE yi)! " is the geometric mean of yj, 1 <i < т. That is, we use the model 
т, 
ze" = a zx (6.94) 
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where €;,1 < i € n are N (0,07). 
The MLE of c, 6 (A) is given by 


(А) = VSSE (А) /n. (6.95) 


To obtain the MLE, À of A, one can proceed as for SLR by choosing a range of values of A, 
often —2 < А € 2 is a reasonable choice, and plotting SSE (А) against А. The minimum 
value can then be estimated visually or by а more formal interpolation process. 


6.5.6 Quick Estimates of А 


Because the Box-Cox method can be computationally intensive, a number of authors 
have suggested “quick transformations" to estimate A. We discuss one here due to An- 
drews [3], which is analogous to the Box-Tidwell method. A modification of this method 
was given by Atkinson in [6]. 

As in the Box-Cox method we begin by trying to determine a value of A such that 
y 9 in (6.92) satisfies the conditions of the GLM. If А = 1, then the data will not require 
transformation. To test this, we expand y in a Taylor series about А = 1 giving 


dy) 
yey (А1) 20 (6.96) 
dà |y 
From calculus, 
d (yp-1\_ylogy (у\—1) 
xí x )- oa (6.97) 
so that j 
pM = ylogy —y- 1. (6.98) 
Hence, near A — 1, 
j^ eg О) ови t1. (6.99) 
Using (6.99) and (6.96) we get an approximate model 
Y -1-(A- 1) (YloggY — Y +1) = 8+ V 82; +є (6.100) 
j=l 
and rewriting this gives 
Y x1+bo+Y 82; - (1— А) (Y logY — Y +1) +e. (6.101) 
j=l 
Letting 
Bo = 1 + 80, Вл =1—А and з» = 91069-9 +1 (6.102) 


where ў are the fitted values for the untransformed model (А = 1) gives the further 
approximation 


m 
y = Bot Уб) + Вм+1®т+1 +E. (6.103) 
j=l 


282 CHAPTER 6. RESIDUALS, DIAGNOSTICS AND TRANSFORMATIONS 


In this form, (6.103) is a standard linear model which can be fit by least squares. 
The estimate ae 4, can then be used to test for a need to transform the data. Since, we 
are considering the null hypothesis that А = 1, this is equivalent to testing @,,,,; = 0 in 
(6.103). This can be done using the usual t-test. 

Using this test, if we conclude that 8,,,, 4 0, then we can transform the data using 
the estimate А of А given by 


А=1- д1 (6.104) 

and refit using least squares and noting any improvements. А 
To simplify matters, it was shown in [20] that the same t values for 8, у are obtained 

if we use 2,441 = Jlogg rather than £m+1 = log — ў + 1 in (6.103). 


Example 6.7 Here we consider the need for a possible transformation of the housing 
price data in Example 5.12 and the birth weight data in Example 5.15. 

For the housing prices we observed that observation 6 was an outlier and for the birth 
weight data there is a substantial amount of unexplained residual variation. Although 
the various statistics and plots suggest that the fitted models (5.98) and (5.103) are 
appropriate, we entertain the possibility that a transformation might improve the fits. 

To examine this possibility we used the Andrews test for the added variable хз = 
g logy. In both instances, our program estimated Bs = 0.0 so we conclude that a trans- 
formation in y is not necessary. 


Example 6.8 (Drink delivery data) As noted previously, although the linear model 
(5.99) appears to fit the data quite well, there are a number of problems that are unre- 
solved. First there are outliers, observations 9 and 22 and the distribution of the resid- 
uals, as shown in Figure 6.8, have a nonnormal appearance (in fact they look roughly 
exponential). Hence, we consider the possibility of a transformation on y to see if these 
discrepancies can be more readily accounted for. 

For this we used the Andrews test by fitting z1, x2 and z3 = 7 log , where ў are the 
fitted values for the least squares fit to y given in Table 6.2. In this case 84 = 0.3939 and 
t3 = 3.43 which is significant at < 1% level. Hence, we conclude that the data should be 
transformed before being fit by least squares. From (6.104) we used the estimate 


^ 


А = 1 — 0.3939 = 0.6061. 


The model р 
yO) 
5 Bo + 821 + Boxe + € 


was used to fit to the delivery times апа the ANOVA table and ¢ statistics are shown in 
Tables 6.5 and 6.6. 

From Table 6.5 we see that F is highly significant and both 3{ and 8, are significantly 
different from zero. The fit is somewhat improved over the untransformed model. 


Table 6.5 ANOVA table for transformed delivery data 


Source df Sum of Squares Mean Squares F p-value 
Regression 2 353.38 176.69 323.08 0.000 
Residual 22 12.03 0.55 
Total 24 365.42 


ooo 
R? = 0.967 Е = 0.964 
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Table 6.6 £ statistics for transformed delivery model 


Predictor Coefficient S.E. Coeff. 
constant 3.7214 0.2488 
Tı 0.40422 0.03874 
29 0.0037095 0.0008198 


From Table 6.6 the regression equation is 


KO 


t-statistic p-value 
14.96 0.000 
10.44 0.000 
4.53 0.000 


1 
= = 3.72 + 0.4048, + 0.003718. 


To examine model assumptions we investigate the residual plots in Figure 6.14. There 
appears to be distinct improvements over the untransformed data. The histogram is 
more symmetric and both observations 9 and 22 are no longer outliers. Overall, the 
transformed model appears to give a better representation of the data than our original 


assumptions in Example 5.14. 
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Figure 6.14: Plots of residuals for transformed drink delivery data 


Example 6.9 (Tree data) Like the Longley and Hald data, the data shown in Table 
6.7 have achieved a certain notoriety in the statistics literature [22, 87, 27]. The data 
are measurements taken on a sample of 31 black cherry trees in the Allegheny National 
Forrest, Pennsylvania, USA. The values of D are the diameters of the trees taken at a 
height of 4.5 feet above ground level and H is the height of the measured trees. V is the 
volume of the trees in cubic feet. The data were collected to provide a way for estimating 
the amount of timber a tree yields using its height and diameter. 
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Table 6.7 Allegheny National Forrest Tree data 
No. Diameter Height Volume | No. Diameter Height Volume 


1 8.3 70 10.3 | 17 12.9 85 33.8 
2 8.6 65 10.3 | 18 13.3 86 27.4 
3 8.8 63 10.2 | 19 13.7 71 25.7 
4 10.5 72 16.4 | 20 13.8 64 24.9 
9 10.7 81 18.8 | 21 14.0 78 34.5 
6 10.8 83 19.7 | 22 14.2 80 31.7 
7 11.0 66 15.6 | 23 14.5 74 36.3 
8 11.0 TO 18.2| 24 16.0 72 38.3 
9 11.1 80 22.6 | 20 16.3 ТТ 42.6 
10 11.2 75 19.9 | 26 17.3 81 50.4 
11 11.3 79 24.2 | 27 17.5 82 55.7 
12 11.4 76 21.0 | 28 17.9 80 58.3 
13 11.4 76 21.4 | 29 18.0 80 51.5 
14 11.7 69 21.3 | 30 18.0 80 51.0 
15 12.0 TO 19.1| 31 20.6 87 77.0 


16 12.9 74 22.2 


Volume 


10 15 20 
Diameter 


Figure 6.15: Scatter plot of volume (V) versus diameter (D) 


To model the data we made scatter plots of V against H and V against D. These are 
shown in Figures 6.12-6.13. Linear trends are seen in both cases so we begin our analysis 
by fitting the linear model 

Y = 8,+8,Р+6Н+є (6.105) 


(Р = ту, Н = тә) to the data and the ANOVA and t statistics are shown in Tables 6.8 
and 6.9. 
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Figure 6.16: Scatter plot of volume (V) versus height (Н) 


Table 6.8 ANOVA table for tree data 


Source df Sum of Squares Mean Squares F p-value 
Regression 2 7684.2 3842.1 254.97 0.000 
Residual 28 421.9 15.1 
Total 30 8106.1 


C ENG Ss Og oo re es руг уыт NC M M DU шш 
R? — 0.948 R = 0.944 


Table 6.9 t statistics for tree model 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant —57.988 8.6380 —6.71 0.000 
Zi 4.7082 0.2643 17.82 0.000 1.4 
29 0.3393 0.1302 2.61 0.014 1.4 


From these tables we see that the overall fit is highly significant (P(F > 254.97) < 
1073) and the ¢ values show that 8, and 8, are significantly different from zero. 

To further examine the validity of the model we computed the residuals and various 
influence statistics which are shown in Table 6.10. Residual plots for £; are given in 
Figure 6.17. Two features are apparent. First, observation 31 appears as an outlier and 
as measured by Cook’s D is the most influential. Since it is the “largest” tree, this is not 
surprising since observations at extreme points of the data often are the most influential. 
Second, the residual plot, Figure 6.18, shows a rather distinct “bowl” shaped character 
suggesting that a transformation might be useful. 


286 CHAPTER 6. RESIDUALS, DIAGNOSTICS AND TRANSFORMATIONS 


Table 6.10 Residual £;, Leverage h;;, Cook's D, and DFFITS for Tree Data 


0.0189 
18  —1.860 0.1435 0.1775  —0.761 
19  —1.324 0.0667 0.0407  —0.354 
20  —1.106 0.2112 0.1083  —0.572 
21 0.029 0.0358 0.0000 0.006 
22  —1.142 0.0454 0.0205  —0.249 
23 0.238 0.0500 0.0010 0.054 
24 —0.946 0.1114 0.0376 —0.335 
25  —0.601 0.0693 0.0092  —0.164 
26 1.213 0.0884 0.0468 0.3778 
27 0.940 0.0960 0.0314 0.306 
28 1.847 0.1064 0.0700 0.465 
29 —0.648 0.1098 0.0177  —0.228 
30 —0.786 0.1098 0.0258  —0.276 
31 2.766 0.2271 0.6052 1.499 


1 0.0978 0.555 
2 1.652 0.1472 0.1479 0.686 
3 1.568 0.1769 0.1673 0.727 
4 0.137 0.0592 0.0004 0.034 
9 —0.289 0.1207 0.0040  —0.107 
6 | —0.368 0.1558 0.0084  —0.156 
7 —0.159 0.1148 0.0011  —0.057 
8 | —0.272 0.0515 0.0014  —0.063 
9 0.318 0.0920 0.0035 0.100 
10  —0.075 0.0480 0.0001  —0.017 
11 0.578 0.0738 0.0091 0.163 
12  —0.122 0.0481 0.0003 —0.027 
13  -—0.018 0.0481 0.0000  —0.004 
14 0.209 0.0728 0.0012 0.058 
15  —1.290 0.0377 0.0212  —0.255 
0.0271 —0.292 
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Figure 6.17: Residual plots for tree data 


To substantiate this we again used Andrew's test and this gave the value 


8, = 0.52953, t3 = 5.9 and р = 0.002. 
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Hence, (3 is significantly different from zero. Using it, 
À = 0.47047. (6.106) 


Fitting of the transformed model with the data transformed as Vv) is left as an 
exercise. Before completing our analysis we discuss some similar results given [20] using 
the Box-Cox method to estimate А. For the model 


YY = 8, - 4D - 84H +e (6.107) 


it was found in [20] that the Box-Cox estimate А of А is 0.307, similar to (6.106). On 
the basis of dimensional considerations they transformed the data using А = 1/3 and 


compared the models 
YO = f, + bD? + B,H +e (6.108) 


and 
Ү = B, + B,D + LD? + bH +e. (6.109) 


Generally these models improved the fit over the untransformed model (6.105) but 
observation 31 was still an outlier (see Exercise 6.6). 

We now consider another approach to modeling the tree data which was suggested in 
[20] but apparently not carried out. Roughly speaking, a tree (exclusive of branches and 
leaves) has the appearance of a truncated cone. For a cone, its volume V is given by 


V = rD?’H/3 (6.110) 
Hence, for a tree it seems reasonable to assume that 
V ~ cDPi Н® (6.111) 


where (c, 3,, 82) are estimated from the data. To do this we take the logarithm of (6.111) 
giving the model (8, = log c) 


log V = 8, + 8; log D+ 8; log H + є. (6.112) 
This model was fit using least squares and the results are shown in Tables 6.11 and 6.12. 


Table 6.11 ANOVA table for log transformed tree data 


Source df Sum of Squares Mean Squares F p-value 
Regression 2 8.1232 4.0616 613.19 0.000 
Residual 28 0.1855 0.0066 
Total 30 8.3087 


R? = 0.978 R = 0.976 


Table 6.12 t statistics for log transformed tree model 
Predictor Coefficient S.E. Coeff. t-statistic p-value VIF 


constant —6.6316 0.7998 —8.29 0.000 
Tı 1.9827 0.0750 26.43 0.000 1.4 
29 1.1171 0.2044 5.46 0.000 1.4 


From these tables we see that the overall fit is highly significant (P{F > 613.9} < 
1073) and again 8; and 6, are significantly different from zero. Notice also that 8; = 
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1.98265 ~ 2 and Bo = 1.1171 which are close to the theoretical values 6, = 2 and 8, = 1 
if (6.112) were exactly true. So the model not only fits the data well, but does so in a way 
that corresponds to the “physics” of the situation. To further investigate the validity of 
the model we calculated the residuals and various influence diagnostics which are given 
in Table 6.13. 

Finally, we give residual plots for ¢; in Figure 6.18. Notice that the histogram appears 
quite normal and the curvature in the residual plot, Figure 6.17 is no longer present. 
Moreover, the I Chart shows that observation 31 is no longer an outlier but observations 
15 and 18 appear to be, but are not influential. Overall, the model based on (6.112) 
seems to be a good representation of the data and should be useful for prediction. 


Table 6.13 Residual £; and Influential Diagnostics for Log-transformed Tree Data 


0.0051 
0.455 0.1672 0.0143 0.204 
0.187 0.1975 0.0030 0.093 

—0.132 0.0589 0.0004  —0.033 


1 0.121 
2 
3 
4 
5 —0.557 0.1214 0.0147 —0.207 
6 
7 
8 


18 | —2.326 0.1255 0.2236  —0.881 
19  —0.931 0.0711 0.0222  —0.258 
20 | —0.046 0.2428 0.0002  —0.026 
21 0.915 0.0375 0.0109 0.181 
22 —0.848 0.0461 0.0117  —0.187 
23 1.461 0.0539 0.0389 0.349 
24 0.031 0.1145 0.0000 0.011 
25 —0.038 0.0709 0.0000  —0.010 
26 1.098 0.0855 0.0372 0.335 
27 0.690 0.0916 0.0163 0.219 
28 1.069 0.0982 0.0413 0.353 
29  —0.676 0.1006 0.0174  —0.226 
30 —0.804 0.1006 0.0244  —0.269 
31  —0.155 0.1803 0.0018  —0.073 


—0.553 0.1521 0.0187  —0.234 

—0.723 0.1194 0.0240 —0.266 

—0.552 0.0506 0.0056  -— 0.128 
9 1.061 0.0907 0.0373 0.335 
10 0.114 0.0465 0.0002 0.025 
11 1.703 0.0722 0.0704 0.475 
12 0.163 0.0464 0.0005 0.036 
13 0.397 0.0464 0.0026 0.088 
14 1.072 0.0728 0.0299 0.300 
15  —2.258 0.0356 0.0547 —0.434 
0.0414 —0.369 


6.5.7 Variance Equalizing Transformations 


When model checking suggests that the error variances are not constant, then one should 
consider transforming the data, to equalize (at least approximately) the variances of ei, 
before using least squares. By definition, the Box-Cox and related transformations do 
this by assuming that the transformed data is N (0, a°). If such a transformation is not 
warranted, in particular, if the data is discrete, then other approaches are necessary. We 
consider two approaches: 


(i) weighted least squares; 


(ii) variance stabilizing transformations. 


Weighted Least Squares 


Here we assume that the errors in (5.1) are normal, but Var (є;) = o? depends on the 
i-th observation. In particular, we assume that 


о? = o? [wi, (6.113) 


= 
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Figure 6.18: Residual plots for log-transformed model of tree data 


where ш; > 0,1 <i < n, are referred to as weights. Our assumption then is that the 
model is of the form (5.1) except that e; ~ N (0, а? | Wi) ‚1< 2 < n. If the weights are 
known, then we can estimate §;,0 < j < m, and о? by MLE as follows. 

Letting 


m 
y= У Ву (6.114) 
j=0 
then the likelihood function of Y;, 1 <i € n, is given by 
1 exp uw a) (6.115) 
a 8Xp = i (Yi — Hi . 
(2по)"/? 20° i=0 


and minimizing L with respect to 9 and c? enables us to obtain 8 and 6”. 
To simplify the arithmetic, observe that 


wi (yi — nj). = (уоту — ушщ) (6.116) 
which gives ; : 
wi (Yi — Hi)” = (zi — т) (6.117) 


where z; = ./0;1;,1 < i < п, and 


m 


ri = X бууш = у буй) (6.118) 


j=0 j=0 
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where vj; = ./witij,1<i<n,0< 3 < т, then 


1 jS à 
L= ———- exp |-> ) (Gi-ri|. (6.119) 
(210) 7 | dd 2. | 
Letting 
y 101 м 01211 t? уш 1т 
/U2.  4/WoX291  ::*  ./U22Z2m 
TS . . . 
Уа aUbi Мп тт 
Vui 0 zx 0 l £u 72 s En 
0 Vu» 1 шор Шоо ++: Lom 
ш 0 eve 0 
0 Di 0 V Un 1 Tni Tn2 UE Tnm 
= УМХ (6.120) 
where VW = diag (,/w1, a] 108.555; VUn); then L is the likelihood function for the model 
Z=VG+eE (6.121) 


where є ~ М (0, o Tn) and z = (/win, 4/W2Y2, ..., Jinn) - Thus, В can be estimated 
by least squares as before with the proviso that the model (6.121) does not have an 


intercept since the zero-th column of V is |! PWD. 5 5 а Ta). Since the least squares 
estimates of 3 are not affected by this, it follows that [3 is estimated by 


^ 


B, = (УТУ) V'z 

= (ух) vox] (VAVX) VWy 
(x’VWVWx) XT VWVAWy 
(X"WX) X"Wy. (6.122) 


|| 


Also, the MLE @ of о? is given by 


a2 ^ \2 ш 


where fi; = (B, Ж), and SSE, is the weighted sum of squares. 
From (6.122) 


E (ĝu) = E|(x"wx) 'xrwv| 


I 


(X"WX) (X"WX)8- 8 (6.124) 
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so that Bu is an unbiased estimator of В. Also, since SS E, is the residual sum of squares 
from fitting (6.121) it follows from Theorem 5.1 that 


SSE, \ 5 
so that SSE 
i el e 6.12 


is an unbiased estimator of o?. Then ү/8 is taken as the estimate of c. Note that (6.122) 

is not the OLS estimator 8 of В in the model Y = X8 + e. From the Gauss-Markov 

theorem [,, is the BLUE estimator of 3 in (6.121), hence the OLS estimator of {3 is less 

efficient than B: The estimate of 3 in (6.121) is usually called the weighted least squares 

estimator (WLS) and is generally preferred to Bos when є; ~ N (0,07/wi),1<i<n. 
A number of additional properties of Bu are given next. 


Theorem 6.4 Let Ba be the weighted least squares estimator of B when У (=) = 
а? diag(1/w 1, 1/w2, ...,1/Wn) in (6.122). Then, 


(i) x (8,) = 0° (X"WX) i 


(ii) Let 6; be the i-th diagonal element of (хТуух)^'. If є; ~ N (0,07/w;) ,1<i< 
n, then (Bui – 8.) /a V6; is N (0,1) and 


(6.127) 


has a t-distribution with n — m — 1 degrees of freedom. 
(iii) Let Yu = XÊ, then E (£a) = XB, so that Y is an unbiased estimator of XB 
and n 


E (Fu) =0?X (XTWX) ' x7 (6.128) 


(iv) Let êu = Y — Ў, be the residuals from Y = XP +e and 82, - Z -Z = Z – VÀ, 
be the residuals from the transformed model (6.121). Then, 


&, = УМЕ, (6.129) 
and E (e*,) = E (€w) = 0. 
(v) Let Hy =X (XTWX) XTW be the weighted hat matriz. Then 
éy = (I- H,)e (6.130) 


and 
У (é,) = 07 (I - Hu) wt. (6.131) 


Hence, 
= (é*,) = 0? W1? (I — Hu) WTH. (6.132) 
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Proof. (i) Since @„ = (XT WX) | X" WY, it follows from Theorem 4.16 that 


M 
ИМ 
[e 
е 
М; 
| 


T 
(X"WX) ' XTW (Y) (X"WX) ' X"W 
= (XTWX) XTWo^W-'WX(XTWX) 
o? (XT WX). (6.133) 


(ii) This follows because Var (Bui) = 926; and from Theorem 5.4 82, is independent 
of Buri so that this follows as in Theorem 5.4. 


(iii) Since Y = XÊ, E ($) — XE (Bu) = X8. Also 


E (Yu) = xx (8,) XT = ох (XTWX) ` 


XT. (6.134) 
(iv) From (6.121) Z = VWY and V = /WX , so that 
ê = Z-Z-V/WY -V/ WX, 
= VW (v В хд) = Ne. (6.135) 


Since Е (ê) = E (у E v) = X6 — XB = 0, it follows that E (62) = VWE (ê) = 0. 
(у) By definition, Ў, = X8,, = X (XTWX) ` XT WY and Y = Xf + e so that 


êw Y -u =X +e- X(XT"WX) XTW (X8 e) 

XB —X8+X (XTWX)  XTWe +e 

I~ (X?WX) ` xrw|e 

(1-Н,,) є. (6.136) 


Using the fact that Н2 = Н,, (it is a projection matrix) (6.131) and (6.132) follow by 
some tedious algebra. We leave the details to the reader. Bl 


From Theorem 6.4 and the prior results it follows that the WLS estimator Ê, of 
В is obtained and its standard error can be obtained by using the transformed model 
(6.121). Moreover, significance tests for G can be done using the transformed model as 
well. However, one must be careful in performing F tests and computing R?. 

Since the transformed model does not have an intercept, the usual decomposition 
of the sum of squares does not hold. Hence the F statistic for the transformed model 
cannot be defined in the usual way and the issue of R? is analogous to that considered 
for SLR through the origin. However, since the “extra sum of squares" principle is valid 
whether or not the model has an intercept, an F test for the overall significance of the 
regression can be obtained from the statistic 


(SSER = SSEr) /m 


2 
5% 


Е, (6.137) 
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where $$ Ер is the error sum of squares s2, from the full model and SS Eg is the residual 
sum of squares from the reduced transformed model 


where Vo = (ywi, ua, ..., Jin)’: If the errors are normal, under the null hypothesis 
РАСИТЕ Е Жыш (6.139) 


and Fẹ has ап Р distribution with (m,n — m — 1) degrees of freedom, so we reject Ho 
at level œ if 
КУ пар (6.140) 


This issue of an appropriate goodness of fit measure is somewhat controversial, but 
a natural choice, as for SLR, is to define 


=p (6.141) 


where р is the sample correlation coefficient of 2 and z. This is consistent with our 
approach for SLR and reduces to R? when W =I. 

For model checking, it is necessary to consider appropriate residual plots. Since we 
have two sets of residuals, &; from the original model (5.1) and 87 from the transformed 
model, we have two choices. As for the case of constant variance, one can also consider 
standardized and studentized residuals obtained by dividing êf and ё; by their standard 
errors. Using (vi) in Theorem 6.4 and (6.120) it can be shown that the standardized, 
hence, studentized residuals are the same. However, we must be careful in determining 
what values we should choose for these to be orthogonal to the independent variables 
х;,1 <i < n, or predicted values. This will be true if we plot £7 from the transformed 
model against the columns of V and the predicted values 2. For example, using (6.129) 


3 46-0. (6.142) 
Since £7 = „/w;ê; and 2; = wii; using these in (6.142) gives 
n 
S wiéid; = 0. (6.143) 
i=1 


Thus the residuals &; from the original model are not orthogonal to the predicted 
values ĝ;, hence plotting ê; against 7; will give a confusing plot. However (6.143) shows 
that plotting ү/ш#ё; against J/w;j; is appropriate. Similar observations hold for plots of 
ё; against the variables x;,1 <i < n. 

One can also examine the leverage of j;,1 € i € n, and influence diagnostics using 
the transformed model. 


Weights Unknown 


In general the weights in (6.114) are unknown and need to be estimated from the data 
along with 8,,0 < j € m, and c?. Unfortunately in this generality this is not possible 
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since there are more unknowns than observations. However, in many situations one may 
have some further knowledge about the error distribution, which enables one to do this. 

For example, suppose we have an experiment with a binary response y = 0 or y = 1. 
For example, suppose the effect of a prescription drug depends on the dose of the drug. 
As a result of giving dose “x” to n, patients if we assume that the sample of patients 
is а random sample of size nz, then the number of recoveries т. is a binomial random 
variable Y, with 


Ne Ter Ng— Tr 
P {Yz = Te} = (" ) (р) (1 — Dz) (6.144) 

where p, is the probability of a recovery at dose т. From (6.144) 
E (Yz) = пер: and Var (Yz) = nzpz (1 — pr). (6.145) 


Unless nz and p, are constant, the variances of Y, are unequal. If, for simplicity, we 
assume that У, depends linearly on =, (intuitively, one would expect the effect of the 
drug to increase as the dose increases), then 


where Var (Yz) = пр (1— pz). 
To estimate £9, 8, we can use WLS and minimize 


q 2 
E mé umi H 
(yi Bo By i) (6.147) 
i=l Nar, Px; (1 = Рз.) 
where y; = Tz, is the number of recoveries at dose z;,1 < à < q. However, p, аге 
unknown, so we seem to have an impasse. In this instance a reasonable estimate of p, is 
т/п. so using this in (6.147) we can estimate 80, 8; by minimizing 


q 2 а 2 
(vi ш Bo = zi) T (yi = Bo = bizi) 
>, па, (Tx, / n, ) (1 — rs rn) — 2 CDU no (6.148) 


i=1 

which is now the SSE for a model with known variances с? = y; (1 — y;/nz,)- 
This process can be iterated. Suppose GP т, are the predicted proportions from 
the WLS fit above. Then we can estimate the variance by 90) (1 — 90) / na.) . These 


can then be used as new weights and £9, 9, reestimated. Obviously, this process can be 
repeated, and if it converges it allows one to estimate the parameters by WLS even if 
the weights are unknown. It is interesting to note that if this process converges, then the 
limiting values Bos B, are the MLEs of ĝo, 81. Hence, for this model, MLE is equivalent 
to iteratively reweighted least squares [14]. We shall return to this matter in Chapter 7. 


Example 6.10 The data in Table 6.14 was collected for the purpose of determining 
if the probability of owning a home is related to income. In column 1 are the incomes 
(in dollars) of 20 people (x) and column 2 lists homeowner status, 1 for an owner and 
0 if not (y). We consider whether there is a linear relation between y and z. Hence we 
propose a model of the form 

Y; = Bg + 82 F Er (6.149) 
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where Y, is a Bernoulli random variable and E (єх) = 0 so that 
E (Ys) = bo + 6,2 = P {Ya = 1j. (6.150) 


Hence Var (Yz) = (85 + 8,2) (1 — 8g — 8,2) and Y; does not have constant variance. 
Hence, the Gauss-Markov theorem cannot be used to obtain optimal linear estimates of 
(B, 61). Of course, if we know the weights in (6.113) then we could used weighted least 
squares to obtain these estimates. Since these are unknown, we proceed in an iterative 
fashion as indicated above. We first fit (6.149) by OLS and then we estimate the weights 
by 

wi = [@(1—$] (6.151) 


where 4; = Bo + Вул, апа Bo and By are the least squares estimates of 8, and £,,1 < 
1 < 20. This was done and the fitted values and residuals are shown in the first two 
columns of Table 6.17. 


Table 6.14 Homeowner data 


y 
0 
1 
0 
1 
0 
0 
1 
0 
1 
1 


FP о Фон н н о н о н Ie 


Table 6.15 ANOVA table for Homeowner Data 


Source df Sum of Squares Mean Squares F p-value 
Regression 1 1.6603 1.6603 9.08 0.007 
Residual 18 3.2897 0.1828 
Total 19 4.9500 


R? = 0.3350 РА = 0.2980 


Table 6.16 t statistics for Homeowner data 
Predictor Coefficient S.E. Coeff. t-statistic p-value 
constant —0.2501 0.2821 —0.89 0.387 
Ly 0.015163 0.041713 3.01 0.007 


From Table 6.15 it appears that the overall regression is significant with 8; appearing 
to be significantly different from zero. However, it is not clear that these statistics can 
be given their usual interpretation since the errors in (6.149) are not normal. To get 
further confirmation of this we made residual plots of the ordinary residuals against ĝ;, 
a histogram of residuals, a normal plot and an I Chart. Examining Figure 6.19. We 
see residuals that look roughly normal, but the residual plot is rather different. One 
can clearly see the effect of the nonconstant variance of the residuals. The residuals 
systematically decrease in parallel. To emphasize this, we show a plot of the absolute 
values of 2; against j;. Such intersecting (or X-shaped) plots are characteristic of errors 
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Figure 6.19: Plots of ordinary residuals for homeowners data 
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Figure 6.20: Plot of ordinary residuals |ё;| versus 0; 
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associated with binary responses. Clearly, making inferences about the estimated model 
using normal statistics should be treated cautiously. 


Table 6.17 Residuals and Weighted Residuals 


Obs. No. OLS Fit 
1 0.1785 
2 0.8446 
3 0.2198 
4 0.4418 
5 0.6639 
6 0.9375 
7 0.3437 
8 0.3076 
9 0.5451 

10 0.9065 
11 0.7155 
12 0.2714 
13 0.7568 
14 0.1630 
15 0.3695 
16 0.9891 
17 0.8704 
18 0.2353 
19 0.3127 
20 0.9272 


OLS Res 
—0.1785 
0.1554 
—0.2198 
0.5582 
—0.6639 
—0.9375 
0.6563 
—0.3076 
0.4549 
0.0935 
0.2845 
—0.2714 
0.2432 
—0.1630 
0.6305 
0.0109 
0.1296 
—0.2353 
—0.3127 
0.0728 


Weight 
6.8197 
7.6179 
9.8313 
4.0549 
4.4812 

17.0699 
4.4331 
4.6954 
4.0328 

11.8020 
4.9124 
5.0567 
5.4331 
7.3296 
4.2922 

93.1473 
8.8643 
5.5577 
4.6526 

14.8121 


WLS Fit 
0.141879 
0.797688 
0.182549 
0.401152 
0.619755 
0.889196 
0.304560 
0.268973 
0.502828 
0.858693 
0.670593 
0.233387 
0.711263 
0.126627 
0.329979 
0.940034 
0.823107 
0.197800 
0.274057 
0.879028 


WLS Res 
—0.141879 
0.202312 
—0.182549 
0.598848 
—0.619755 
— 0.889196 
0.695440 
—0.268973 
0.497172 
0.141307 
0.329407 
—0.233387 
0.288737 
—0.126627 
0.670021 
0.059966 
0.176893 
—0.197800 
—0.274057 
0.120972 


To see if one gets improvement using WLS the model (6.149) was refit using the 
weights in column 4 of Table 6.17. A direct comparison could not be made between OLS 
and WLS so we computed the test statistics as if this was an OLS fit. The ANOVA 
table, t statistics and residual plots are shown in Tables 6.18-6.19 and Figures 6.21-6.22. 


Table 6.18 ANOVA table for Homeowner data using WLS 


Source df Sum of Squares Mean Squares F p-value 
Regression 1 18.946 18.946 13.25 0.002 
Residual 18 25.733 1.430 
Total 19 44.680 

TU a  ) A ee 
R? — 0.424 R 0.392 


Table 6.19 t statistics for Homeowner data using WLS 


Predictor Coefficient S.E. Coeff. t-statistic p-value 
constant —0.2801 0.2878 —0.97 0.343 
21 0.045084 0.041396 3.64 0.002 


From Tables 6.18-6.19 we see that the WLS fit is quite similar to the OLS results, 
although the standard errors of the coefficients are a little smaller. However, the plot of 
|ё;| against ĝ; still has a characteristic X shape. Again the nonconstant variance and non- 
normality of the errors makes inferences problematical. We shall return to this matter 


in Chapter 7. 
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Figure 6.21: Plots of weighted residuals for homeowners data 


Residual 
cO 
tn 


0.0 0.1 0.2 0.3 0.4 0506070809 
Fit 


Figure 6.22: Plot of weighted |é;| versus 0; 
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6.5.8 Variance Stabilizing Transformations 


Another approach to equalizing variances is to try to transform the observations у; so that 
the transformed values 2; = 9 (у;) have approximately equal variance. As a particular 
case, the Box-Cox method does this, but more general transformations may be needed 
than power transformations. 

To simplify matters, assume that the variance of Y; is a function of u; = E (Y;). For 
the binomial distribution 


Var (Yi) = TDi (1 = pi) s (6.152) 
However, E (Yi) = nip; = p; so that p; = E (Y) /n; and 
Var (Yi) = ni (u/ni) (d — шут) = (ni — ui) (mi/ni) = g Qu). (6.153) 


With this assumption we will try to find g such that Var |g (i;)] does depend on i. 
Assuming, Y; does not vary considerably from u; we use Taylor's series to get 


g (Yi) = g (us) (Yi - ш) 9 (uj). (6.154) 
Hence, 
Var [g (Y)] = [9 (m)? Var (Ү;). (6.155) 
If Var (Y;) = f (u,;), then 
Var [g (Y.)] = [9 (u)] F (и). (6.156) 
For this to be independent of i, we must have 
[9 (ul? F (ш) = e? (6.157) 
9 (ш) = o/ V f (и). (6.158) 


Letting и = и; g can be found by integrating (6.158) giving 


^ ^ dx 
saot dime 6.159 
g (us) T ( ) 


Example 6.11 If Y; is a binomial random variable, then f (и) = u(n — и) /n, and 
g is given by 
упш (6.160) 


= | ——m oy 


Letting ny — z, (6.160) becomes 
u/n ndy/n " p/n Jndy 
v ny (n — ny) Vy (1 — у) 
A^ dy 
vy 0) 
2/nsin ^! (y/n). (6.161) 


I 


| 
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For proportion data this gives 


Zi = g (Y;) = 2/nisin"! AE п) (6.162) 


аз ап appropriate transformation to equalize variance. Transforming the data this way 
indicates that an ordinary least squares estimation procedure can be used to estimate 


Bo; 8}. 


Example 6.12 As we assumed previously it was suggested in OzDASL (see Ex. 6.1) 
that а gamma error distribution might be a better choice than а normal error model 
for the drink delivery data. Also, our discussion of the Box-Cox method suggested that 
a transformation is appropriate. Here we examine the possibility of using a variance 
stabilizing transformation. Since an exponential random variable is the simplest gamma 
random variable, we examine this possibility. 

Hence, we assume that Y; has a density 


f (ys) = Aexp (-Xuyi). (6.163) 


In this case ci 
Е (Yi) = Í Aye dy. (6.164) 
0 


Letting 2; = Aiyi 


E(Y)- | (aestus S POPE = (6.165) 
0 
Similarly, 
PY?) 22). (6.166) 
Hence, 
Var (Y;) = E (Y?) - [Е (Y)? = 1/22 = ui. (6.167) 
Using this in (6.156) we choose u; by 
le Qr / id = 0? (6.168) 
Or 
9'(ш)/ ш = о. (6.169) 
Integrating gives 
Hi 
7 = / (02/4) du; = о? log p. (6.170) 
Hence, this suggests transforming the drink delivery data by 
log у = 80 + 8121 + €. (6.171) 


One can apply this model to fit the data and compare the outputs with the results 
from the untransformed data. We leave the details to the reader. 


6.6. CORRELATED ERRORS 301 


6.6 Correlated Errors 


As we have observed in a number of places residual plots can sometimes be used to detect 
correlation between the errors in the GLM. Typically these will show up as systematic 
oscillations in plots of residuals against fitted values. Most commonly, this behavior 
occurs for data taken over time, such as population or economic data. 

If the correlation coefficient p is positive, then the oscillations are slow as shown for the 
Clark County population data and the Longley data. When the correlation is negative, 
then the oscillations tend to have much short on periods since successive residuals tend 
to have opposite signs. When the residuals are correlated, the independence assumptions 
of the GLM fail so it is important to identify its occurrence and remedy it if possible. 
This topic is a separate branch of statistics usually called time series analysis and is 
generally beyond the scope of this text. In recent years models with spatial correlation 
have become increasingly important in many applied areas. This topic, usually called 
kriging, has many features in common with regression analysis - but again details are 
beyond the scope of the text. 

If autocorrelation is present and the errors have constant variance then it follows that: 


(i) the OLS estimator of 3 in (5.11) is unbiased, but the Gauss-Markov theorem does 
not hold so the least squares estimates no longer have minimum variance; 


(ii) MSE = SSE/ (n — m — 1), the estimate of о?, may be substantially smaller than 
the true value of c? and give a false impression of accuracy; 


(iii) as a consequence of (ii) t values may be inflated and inferences and confidence 
intervals for the parameters may give a false impression of precision; 


(iv) since the errors are dependent, F and t tests are not strictly valid even if the errors 
are normal. 


6.6.1 The Durbin-Watson Statistic 


Detecting and correcting for autocorrelation is a rather complex subject so we will limit 
our discussion to the case where the observations are taken in time t = 1,2, ..., n and the 
errors €;,1 < t < n, satisfy the first order autoregressive condition 


Et = p& + ui, |p| <1 (6.172) 


where p is the autocorrelation coefficient and ш, t > 1, are independent normal random 
variables with constant variance с? and и; is independent of €;,t > 1. Generally, for time 
series data p > 0 (positive autocorrelation). To test the hypothesis Ho : p = 0 against 
Hi: р> 0 one usually uses the Durbin- Watson statistic 


n ^ ^ 2 
d= 2a) Ba (6.173) 
t=1 Êt 


where é;,1 < t € n, are the residuals estimated from the least squares fit to (5.1). If one 
concludes that p > 0, then p is estimated by 


TIE (6.174) 
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From (6.173) and (6.174) one can show that d = 2(1— p). In fact, the numerator in 
d is 


n n п n 


" ^ 2 a2 a2 nA 
} (2, — & 1) = } a+) a 1—2 &£14€£—] 


t=2 t=2 t=2 t=2 


M 

nN 
МЕ 
Сту» 

+ № 

| 

nN 
VL 
m 

~ 

Сту» 

~ 

| 

mn 


=2 
п 
= 2(у2- а). (6.175) 
=2 


Again, using the approximation $77 „22 c У „22 in (6.175) it follows from (6.174) 
and (6.175) that 
dc 2(1-— p). (6.176) 


Using the Cauchy-Schwarz inequality (see Exercise 6.8) У) 56:6: lies approxi- 
mately between +1 so that d lies between +4. Hence, values of ? near one gives d near 
zero while if ò ~ 0, d ~ 2. Hence, small values of d lead us to accept Ну: p > 0 while 
d = 2 leads us to accept Ho : p = 0. The critical values for the test were given by Durbin 
and Watson in [29, 30, 31] and some values are reproduced in Table A.5. 

The formal test procedure is given as follows: 


(i) calculate d in (6.173) using residuals from the least squares fit of (5.1) 
(ii) obtain critical values (dr,du) for an appropriate sample size п and number of 
variables т + 1 in Table A.5. Then, 
(a) reject Ho if d < dr, 
(b) do not reject Ho if d > dy 
(c) if dz < d < dy, the test is inconclusive. 


To test for negative p, apply the Durbin-Watson test to d' — 4 — d. 


6.6.2 Correcting for Autocorrelation 


If one concludes that the errors are autocorrelated then there are a number of procedures 
which can be used to correct for this possibility. Here we discuss two methods, the 
Cochrane-Orcutt method [17, 87] and one due to Hildreth and Lu [72]. 

In the Cochrane-Orcutt procedure the model is transformed using (6.175) to produce 
a model with uncorrelated errors. For simplicity, we illustrate the approach for the simple 
regression model (t = "time") 


Y, = Bo + B2: + & (6.177) 


where =; satisfies (6.172). 
Now 


Yr— pYsi-i = Bo +82: += — p(Bg + Bi 21-1 + 1) 


Bo (1 — p) + B4 (£t — (2—1) + Et — pEt-1 
Bo (1 — p) + By (z« — pzi-i) + ut (6.178) 
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where by assumption, ш; are independent N (0, c?) random variables. 
If we let 27 = 1; — pzi i, Yr = Ys — pYi i, 80 = Bo (1 — р), 81 = B,, then (6.178) 
takes the form 
Y, = Bo + Biri tu, 2 St Sn: (6.179) 


Now (6.179) is in the form of the GLM and the parameters (50, 81) can be estimated 
by least squares. However, since we need to know p to do this, generally one has to 
proceed in an iterative fashion to obtain the required estimates. А procedure for doing 
this follows: 


(i) fit (6.177) using OLS and obtain the residuals &j,1 < t < n; 


(ii) estimate p from (6.174) or if your program produces the Durbin-Watson statistic as 
output, then using (6.174) a quick estimate of p is given by 


p = 1 — 4/2; (6.180) 


(iii) construct the variables y; — 20:1 and 2; — pri; 


(iv) regress y; — Pyi-1 on T: — pyi-1 to estimate 30, 81. Then let By = 8; and Bo = 
Bo/ (1 — p) as the estimates in (6.177); 


(v) if the residuals from fitting the transformed model show no autocorrelation then 
stop. Otherwise repeat the procedure starting with the estimated transformed 
data; 


(vi) iterate to convergence. 


It is often recommended that only the first step be used [27]. 


Example 6.13 To demonstrate the use of the Durbin-Watson statistic, we refer to 
the Longley data fit with the two best predictors z2 and z3. The А? for this model is 
0.981, and all three regression coefficients are highly significant. On the surface it would 
seem that the analysis looks complete and that a good-fitting model has been found. 
However, clearly as we see in Figure 6.23, a plot of the residuals against the fitted values 
strongly suggests the presence of autocorrelation in the residuals. In addition to graphical 
displays, in order to use the Durbin- Watson procedure to detect the autocorrelated errors, 
the computations for d and р are displayed in Table 6.20. Recall that we assume the 
residuals follow a first-order autoregressive model. 

From Table 6.20 and using (6.173) we obtain the value of the Durbin-Watson statistic 


_ 3493839 


= 3579065 °С 


and using (6.174) the estimate of the autocorrelation parameter p is 


1557272 


p= 


From Table A.4 in Appendix, for a = 5%, n = 16, and m = 2, we observe dr, = 0.98. 
Since d < dr, we reject the hypothesis Но : р = 0 and conclude that p > 0. If d is 
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Figure 6.23: Plot of residuals versus fitted values 


significant, we go through the Cochrane-Orcutt procedure using the transformed variables 
to remove the autocorrelation. We leave this to the reader for an exercise. 


Table 6.20 Computing the Durbin-Watson Statistic 


No. ё; ё;_1 ё; = ё;_1 (2; — &i. 1) E cm ё;ё;_1 
1 355.92 - - - 126680 - - 
2 186.88 355.92 —169.048 28576 34924 126680 66514 
3 25.43 186.88  —161.453 26067 646 34924 4752 
4 —142.97 25.49 —168.395 28357 20440 646 —3635 
5 —468.73  —142.97 — 325.757 106118 219704 20440 67013 
6 —823.54 —468.73  —354.811 125891 678213 219704 386013 
T —202.97 —823.54 620.566 385102 41197 678213 167154 
8 —416.53  —202.97 213.564 45610 173501 41197 84544 
9 175.02 —416.53 591.551 349932 30631 173501  — 72900 
10 1146.89 175.02 971.876 944542 1315360 30631 200724 
11 628.24 1146.89  —518.648 268996 394690 1315360 720527 
12  —146.46 628.24 —774.705 600168 21451 394690 | —92014 
13 19.80 | —146.46 226.266 51196 6369 21451 —11688 
14 300.04 79.80 220.233 48502 90022 6369 23944 
15 —46.58 300.04 —346.622 120147 2170 90022  —13977 
16 —650.44 —46.58 | —603.851 364635 423066 2170 30300 
Sum 3493839 3579065 3155999 1557272 


The Hildreth-Lu procedure is a modification of the Cochrane-Orcutt procedure in 
that it seeks to minimize the SSE of the transformed model (6.121) simultaneously with 
p. That is, we find (80, 81,0) to minimize 


> (yi — Bo =й. (6.181) 


it 
№ 
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Generally, this requires more work than the Cochrane-Orcutt method since minimiz- 
ing (6.181) is a nonlinear optimization problem. On the other hand, the Cochrane-Orcutt 
procedure can be carried out using a straightforward modification of a standard OLS pro- 
gram. 

Last, we note that autocorrelation may appear for reasons other than some inherent 
property of the data. 


6.7 Generalized Least Squares 


As a last topic in this chapter we consider the problem of least squares estimation when 
the errors € in (5.1) have an arbitrary variance-covariance matrix W. If W is known, 
then least squares estimation can be done, as for weighted regression, by converting the 
model to an equivalent one with uncorrelated errors with constant variance. 

Assume now that 


n 
Y; = bot У 28; +8, 1<і<п (6.182) 


7=1 


when є = (єү,єд,...,Єһ)1 has a joint multivariate normal distribution with E (e) = 0 
and У (e) = W. Then, the joint likelihood density function is given by 


—n/2 
fy (y) = (v 2т М) exp |- (y - u, W (y - и))/2]| (6.183) 
where и = ХВ. 
Thus the MLE of Ó is found by minimizing the positive definite quadratic form 
Q= (y - М! (у – и)). (6.184) 


Since W is symmetric it can be factored as 
W = RR? (6.185) 


and 
wW = (RRT) = (R7) R - (R-1)! вт! (6.186) 


where R. is nonsingular (this can be done, for example as in Theorem 4.12 or using 
Cholesky factorization). Then, 


О = (y - p, (R1) R^! (y - и)) 
= (R` (у– р), R (у – 0) 
= (R!y-R/XBg,R^y-R/?Xg) 
= (z— XnB,z — Xnf) (6.187) 


where z = R^!ly and Xp = R^ !X. 
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Now О is in the form occurring in (5.83) with z replacing y and Xg replacing X. 
Hence it follows from Theorem 5.1 that Q is minimized by choosing 


^ 


Bw = (хх) "хта = (Rx) RIX] (RX) R-y 


|| 


|| 


[XT (R) кх xT (RO)? Roly 
[x7 (квт) Е x| ят (квт) E 


(X"W-X) XTW ty. (6.188) 


|| 


^ 


Bw is called the generalized least squares estimator of 6B. 

When the errors are not normal, then this argument suggests that we estimate 8 
by Bw in this case as well. In fact, it can be shown, generalizing the Gauss-Markov 
theorem, that Gy is the BLUE estimator of 8 in (6.182). When W = c?L, Gy reduces 
to Bors and when W = o?diag(1/w;, 1/w», ..., 1/w,) it is the weighted least squares 
estimator Bw. 

For further discussion it is useful to consider to Gy as the OLS estimator for the 
transformed model 

Z—Xnp0-4ó (6.189) 


where Z = R^! Y and б = R^ !e. Using (6.189) 


T 


|| 


В! (e) (К!) -R-'W(R-!) 
RRR? (671) =I, (6.190) 


У (ô) 


so that the errors in (6.182) are uncorrelated with Var (б) = 1,1 € i < n. Using this 
we can easily establish the BLUE property of Gy and derive a number of other useful 
properties of Bw. We summarize these in Theorem 6.5. 


Theorem 6.5 Consider the GLM (6.182) where У (=) = W is nonsingular. Then, 


(i) If the errors are N (0, W) then the generalized least squares estimator Bw is the 
best estimator of В. 


(ii) If the errors have an arbitrary joint distribution with E (є) = 0, then By is the 
BLUE estimator of В. 


(iii) If У (e) = cl ee then Bus = Bare. 


(iv) The variance-covariance matrix of Bar » (Bw) is given by 
x (Bw) -(X"w-x)'. (6.191) 


Proof. (i) This can be established using the fact that Вор is the standard GLM 
estimator of @ and the relation of (6.182) to the transformed model. For details see 
[27, 85]. 
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(ii) Again this can be proven using the transformed model. Details are left to the 
reader. 

(iii) This has been established above. 

For (iv) we have 
-1 


>> (Bw) - X"W-!x(Y)W-!x(xTw-!x)' 


E 1 


) 

) X'W-CWW'X(X'W-!X) 
XTW-1X) ~ (XTW-!X) (XW^X) ^ 

) 


(6.192) 
Since By is unbiased, By, is N (a, (x?W-!x)~*) if e is N (0, W). M 


For purposes of calculation and testing it is usually more convenient to use the trans- 
formation Z = R^! Y, Xg = R^!X in going from (6.189) to (6.190) rather than com- 
puting Bw directly. Doing this we arrive at the following algorithm for carrying out the 
least squares analysis of the model in (6.182). 


(i) Factor W as RR’. (Say using Cholesky factorization.) 
(ii) Transform the data (y, X) to (R^ !y, R^! X) = (z, Xp). 


(iii) Regress z on the columns of Xg using OLS to obtain By. 


(iv) Since XTW-!X = XT (RRT) X = XT (R-!)' R!X = XẸXp, E (ôw) = 


(ХЕХ к)! so that Var (ём) is the i-th diagonal element 6; of (ХЇХ д). 


(v) When the errors are normal (Bw. — 8) / V6; is №(0, 1) so that a (1 — o) x 100% 
confidence intervals for B; are obtained from 


Bw з + Za/j2Vbi, OS tS mM (6.193) 


and these may be used in the usual way to conduct hypothesis tests concerning the 
coefficients ,. 


(vi) Tests of the general linear hypothesis CG = b when є ~ ЇЧ (0, W) can be carried 
out using the "extra sum of squares principle" on the transformed model (6.121). In 
this case, since c? = 1 is known, ASSE will have a x? (r) distribution when Hp is 
true, where т = rank(C). Thus Но: CB = b is rejected at level a if ASSE > x?,4. 


In particular, one can test for the overall significance of the regression by fitting the 
reduced model 
Zi = TR; bo + ĝi, 1 <i <n (6.194) 


where zg,, is the io-th element of Xp, forming the residual sum of squares and subtract- 
ing the SSE from fitting the full model (6.182). The hypothesis Но: 6; = ba =- = 
Bm = 0 is rejected at level a if ASSE > y? ме. 
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Since the transformed model (6.121) does not have an intercept, the same problem 
arises as to an appropriate goodness of fit measure as for WLS. A choice consistent 
with those given previously is to use the squared sample correlation coefficient of z and 
% = X af). in (6.189). Another choice, given by Buse can be defined as follows. 

Let 

Vw = (1,W'*y) /(n, W^!n) (6.195) 


where 1 = (1,1, ..., 1)" ‚п = nl and yw denote the weighted mean of the observations 
yi,1<i<n. (It is easily shown that yw is the generalized least squares estimator in 
the reduced model Y; = 8, + £j, 1 € i € n, У (=) = W.) Then it can be shown that the 
following decomposition holds: letting Y — Xf 


(y - 71, W~? (y - 71)) = (ў - 71, W~ ($-yD) + (у - ӱ, М! (y-$)). (6.196) 
———M—  ———.— ase” 
55Т SSR SSE 


Since each of the quadratic forms in (6.196) is positive definite, we can define 
Ri = SSR/SST (6.197) 


which has properties analogous to the usual R? and reduces to it when W = o7I,. 

When W is unknown, as we have already seen for WLS, the problem of estimating @ is 
generally intractable unless further assumptions are made concerning the error structure. 
If W can be parameterized by a small number of parameters, then maximum likelihood 
estimation may be possible as well as generalizations of iteratively re-weighted least 
squares. These topics are dealt with at length in the literature on time series analysis 
[11] and kriging [23] and will not be further dealt with in this text. 


6.8 Exercises 


6.1 [90] An experiment was conducted to gain some preliminary insight into the effect 
of three quantitive factors on the capability of a particular coal-cleasing operation. 
À polymer was used to clean the coal and the amount, y was measured (mg/l). 
The factors that influenced the suspended solids are 


тү : percentage in solids in the input solution 
22: pH of the tank that holds the solution 
23 : flow rate of the cleasing polymer, ml/minute 


Assume that all three factors were controlled in the experimental process. The 
data are given in Table 6.21. 


For the multiple regression model containing all three regressors, calculate the 
following influence diagnostics. 


(a) Residuals 2; and £j, and make plots againt fitted values. 
(b) Leverages - HAT diagonals А;;. 

(c) Cook's D;. 

(d) DFFITS;, and the cut-off point. 

(e) DFBETAS; ; (j = 1,2,3), and the cut-off point. 
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Table 6.21 Coal-cleasing Data 
No. 21 Жә 23 yY 
1.5 60 1315 243 
1.5 6.0 1315 261 
1.5 9.0 1890 244 
1.5 9.0 1890 285 
2.0 7.5 1575 202 
2.00 7.5 1575 180 
20- 7.5 1575 188 
2.0 7.5 1575 207 
25 90 1315 216 
10 2.5 90 1315 160 
11 2.5 60 1890 104 
12 2.5 60 1890 110 


O о -у С ль ооо оҥ 


6.2 Show that Ba; the generalized least squares estimator given by (6.188), is unbiased 
for Bw. 


6.3 Show that the vector Ba which minimizes 


(y – X8w) W^ (y - XBw) 
is given by the generalized least squares estimator given in (6.188). 


6.4 An experiment was conducted to study the growth characteristics of corn roots 
in the presence of a particular herbicide that was applied to a certain type of soil. 
Data were collected and percent of control, meaning the percent of the growth 
observed without the herbicide, was used as the response [90]. 


Obs. Concentration of Percentage of 
No. Herbicide (2) Control (y) 


d 0.5 95.8467 
2 1.0 91.6561 
3 2.0 81.5142 
4 8.0 75.7477 
5 32.0 68.7061 
6 128.0 35.9895 


(a) Plot the data and give a comment. 


(b) Using the transformation log (y;) , fit the data to the model 
log (yi) = Bo + 82: + є. 


(c) Find s? for the model in (b). 


(d) Compute the original residuals, the PRESS residuals, and the sum of the ab- 
solute PRESS residuals. 
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For the regression models shown below, determine whether it is a linear model, 
an intrinsically linear model, or a intrinsically nonlinear model. If the model is 
intrinsically linear, suggest how it can be linearized by a suitable transformation. 


(a) y = fefe 

(b) y = 8, exp (82 + Baz) +e 

(c) y= 8; + By exp (G32) += 

(d) y= 61 + (62/01) +€ 

(e) y = 01 + 0521 + 02 cJ +e 

Using two models in (6.108) and (6.109) for tree data in Table 6.7, make plots of 


the studentized residuals 7; against the fitted value D and H respectively. Do you 
observe any outlier(s)? 


What happens to the weighted average 


_ weary + were 


6.198 


if the first weight шу approaches zero? The measurement тү is totally unreliable. 


Suppose that you have n independent measurements 21, 22, ...,24 from your pulse 
rate, weighted by w1, W2, ..., Wn, what is the weighted average that replaces (6.198)? 
It is the best estimate when the statistical variances are c? = 1/w2. 


(Cauchy-Schwarz inequality) If X and Y have means py, py and variances 02,, 02, 
respectively, prove 
[E (XY) < E (X?) E (Y°). 


6.10 Consider the first-order autoregressive model for a sample of size n — 32: 


Ү = Bo + B1zu + Вох + 83213 + € 


where =; = рер] + uz with |p| < 1 and и, are independent N (0, с?) , 
(a) Explain the procedure for testing Ho : p = 0 versus H; : p > 0 at а = 0.05. 
(b) Explain the procedure for testing Ho : р = 0 versus Ну: p #0 at a = 0.01. 


6.11 A study was conducted by McNamara and Browne [80] to examine the rela- 


tionship between the prices of gold and “black gold" over a period of time. Table 
6.22 presents quarterly data from March 1976 to March 1980 on the price of gold 
($/ounce) and the price of petroleum ($/barrel). Take the price of gold as a re- 
sponse variable Y and the price of gold an explanatory variable z. 


(a) Fit a simple linear regression model to these data. 


(b) Determine whether or not the assumption of independence of residuals in the 
OLS model appears to have been violated. 


(1) by plotting the residuals against time. 
(2) by calculating Durbin-Watson test. 
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(с) If necessary, correct the autocorrelation using the Cochrane-Orcutt method. 


Table 6.22 Petroleum and Gold Prices in United States 
Year Month Petroleum Year Month Petroleum 


6.12 Find the weighted least squares solution Xw to Ax — b: 


1 0 2.0 0 
1 2 1 0 0 1 
Check that the projection AXw is still perpendicular (in the W-inner product!) to 


the error x — AXw. 


6.13 Consider the simple linear regression model 
Yi = Bo + 8,2: + €i 


where the є; are independent with E (є;) = 0 and Var (є;) = z2c?, i = 1,2,...п. 


(a) Show that weighted least squares estimation (WLS) is equivalent to ordinary 
least squares estimation (OLS) for the model [104] 


Ү; 1 
B)» (os 
Ti Ti 
(b) Under the normality assumption of errors б; in the model in (a), find the MLEs 
of 8,, Во. 


6.14 Use the Cochrane-Orcutt procedure to correct for the autocorrelation in the 
Longley data in Example 6.3. 
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Chapter 7 


Further Applications of 
Regression Techniques 


7.1 Introduction 


In this chapter we will expand on a number of regression modelling techniques that 
were discussed briefly in Chapters 5 and 6. These will include a further discussion of 
polynomial and piecewise polynomial models in one and several variables, the further 
use of dummy (indicator) variables to deal with qualitative factors in modelling and last 
a further discussion of binary response models and logistic regression. These techniques 
allow one to model a wide variety of phenomena in science and technology. 


7.2 Polynomial Models in One Variable 


As we indicated in Chapter 5, by choosing x; = 22,0 € j € m, in the GLM the model 
(5.1) becomes 


m 
Y — 8+ b; += (7.1) 
j=l 
which we refer to as a polynomial model of degree m. Here E (Y) depends nonlinearly 
on x but is a linear model since it depends linearly on the parameters Bj, 0<7< т. 

Generally a polynomial model is suggested if a scatter plot of (x, y) shows substantial 
curvature or a residual plot from a linear fit in z shows curvature as well. When the 
behavior in x appears monotone, then variable transformations as discussed in Chapters 
3 and 6 are often appropriate. However, if the curve is not monotone such as that shown 
in Figure 7.1, then a polynomial model may be more appropriate. For modelling over 
large ranges of т, piecewise polynomial (spline) models have many advantages. These 
will be discussed in the following section. 

Although fitting a polynomial appears to be a straightforward application of the 
GLM, there are a number pitfalls that one should be aware of in using a polynomial 
model. First the normal equations arising from (7.1) can be highly ill-conditioned, even 
for low order polynomials, particularly if the observations are taken over a small range 
of т. This can lead to substantial round-off errors which one should be aware of when 
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Figure 7.1: A non-monotonic curve 


using “off-the-shelf” software. Fortunately, there are a number of standard certified data 
sets which are available (they can be found on the Internet) to test the accuracy of one’s 
software. 

Second, there is the issue of choosing the degree m of the polynomial in (7.1) Since 
a polynomial of degree m can be fitted exactly to т + 1 data points, one needs to be 
cautious of overfitting the data. Generally, one should use as low order polynomial as 
possible to obtain a satisfactory fit. Models which are overfit will not be useful for 
prediction. 

To deal with the ill-conditioning, a widely recommended remedy is to use centered 
variables | 

zi = (1-2)! (7.2) 


where Т is the mean of the independent variable. One then fits the model 


Ү= 80+ Ny; += (7.3) 
ј=1 


rather than (7.1). By expanding 2? using the binomial theorem опе can then relate Yj 
to 8j, 0 <j < т, in (7.1). Another, more sophisticated approach is to use orthogonal 
polynomials which will be discussed in this Section. 

Choosing the degree of the polynomial is a more difficult problem, characteristic of 
all model building. Too low a degree may give a poor fit, too high a degree may give a 
good fit, but be unreliable for prediction. Anticipating our discussion of subset selection 
in Chapter 8, we examine two approaches - forward selection and backward elimination. 

In forward selection, one can start with a linear (їп т) fit and examine the summary 
statistics and residual plots. If y; is significant, R? is large and residual plots show no 
evidence of curvature, we might stop. If either R? is small and residual plots of ê against 
т or againt 7 show curvature, then one might add a quadratic term. This process can 
be repeated, until an adequate fit is obtained with significant coefficients, large R? and 
appropriate residuals. 

This approach has its pitfalls, because one might stop too soon. For example, if the 
true model is 

Y = bo + Biz + Bar? +e (7.4) 
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then, the procedure outlined above might suggest that a quadratic term is not present, 
since a £ test might indicate that 83, = 0 and one might stop adding variables, even 
though the model contains a cubic term. Of course, if R? is still small and residual plots 
show distinct curvature one might consider higher order terms. 

In backward elimination one starts with а polynomial of sufficiently high degree less 
than the number of data points and deletes terms whose coefficients have small £ values. 
Again, because of multicollinearity one must be careful in just eliminating variables “en- 
masse". Elimination of one variable at a time is generally a preferred approach. If one 
uses orthogonal polynomials, then these can be entered in any order and provide a less 
ambiguous choice of model. 


7.2.1 Orthogonal Polynomials 


As we have noted, even though the ill-conditioning can be alleviated by centering, there 
may still exist a high level of multicollinearity which usually causes computational dif- 
ficulties. These difficulties can be avoided using orthogonal polynomials. Orthogonal 
polynomials are uncorrelated. 

Suppose that the model is given in (7.1). Since the columns of the design matrix X 
will not be orthogonal, we now consider the model 


Yi = Yoo (2i) + Үү (21) +++ + Ym Ym (Ti) + £i (7.5) 
where т, (2;) is an r-th degree polynomial (т = 1,2, ..., m) in the z;'s (i = 1,2, ..., n) such 


that 


n 


Уу, (zi) Y, (ж) = 0, for all r,s, rz s (7.6) 
i=l 
and 
Фоа) =1, i= 1,2 n: 7.7) 


Then the model in (7.5) becomes Y = Xy + e, where 


Wo (21) (21) +++) Ym (21) 
Wo (22) (42) +++ Wm (22) 


x = 


omc кысчы lod 


Since the columns of the X matrix are orthogonal, 


isti Vo (x4) 0 moe 0 
0 У Vi (wi) c6 0 
хёх=| a т 
0 0 e e б (21) 


and the least square estimator is given by 4 — (XTX) XTY, we have 


РЕА 3a Yj Qi) Yi 


27 = " ; J = 0, 1,2, 25, M, (7.9) 
7 Era YF (2) 
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Since y (z;) = 1, it follows from (7.9) that 


n 
Do ee 
i=1 =Y., 


n 


o = (7.10) 


The residual sum of squares is then 


SSE = (Y-X4) (Y-X4) = ҮТҮ -47X?x¥ 


= 2.5 RA Э z 4 


= See ye [ме Е (7.11) 


If one wishes to test Ho : Ym = 0, which is in fact equivalent to testing Ho : 8,, = 0 
n (7.1), the residual sum of squares under the null hypothesis is 


1 


han- ses 


SSE + А2, У 02 (£i). (7.12) 


ї=1 


88Ен, 


Then, the test statistic would be 


SSEm -SSE _ 4,307, v2, (xi) 


fon a= asl) SSE n= m= 0. 


(7.13) 


The orthogonal polynomials 7; (2;) can be obtained in many different ways. In 
particular, if the levels of x are equally spaced, they can be easily constructed. A survey 
of methods for generating orthogonal polynomials can be found in Seber [104]. We note 
that the method of generating the V, is similar to Gram-Schmit orthogonalization, with 
the difference that only the preceding two polynomials are involved at each stage. 

Some of these orthogonal polynomials are given in Table 7.1. 


Table 7.1 Coefficients of Orthogonal Polynomials 


с CC URN S SN 
91 V3 | Yı Pa Vs 


For the case n > 7 readers should refer to [24, 94, 27]. 
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7.2.2 Piecewise Polynomial Models 


Although polynomial models are conceptually quite simple as a way of dealing with non- 
linear trends in a predictor variable =, there are some difficulties, particularly if the range 
of the data is large. In this case а piecewise polynomial model can be computationally 
and conceptually easier to use. Such problems often occur in economic data where the 
independent variable represents time and various time trends may be present. 


Example 7.1 Consider the data in Table 7.2 below. 


Table 7.2 A Data 
1 2 3 4 5 6 T 8 9 
23 28 65 74 10.2 105 12.1 13.2 18.6 


Graphing this data in Figure 7.2 indicates that the first five data points lie on one 
line while the final four appear to lie on another. 


Figure 7.2: Scatter plot of data in Example 7.1 


If this is the case, then we can perform two regressions - one for the first five points 
and another for the last four and be done. However, one of the things we might like to 
know is whether our visual impression is correct. For example, do the slopes of the two 
lines really differ? Another reasonable question to ask is whether т — 5 is the abscissa of 
the point of intersection of the two lines. That is, is the true model of the form in Figure 
7.3 or Figure 7.4. As we shall see, an appropriate choice of multiple regression models 
allows one to make such inferences in a straightforward way. 

Historically, such problems appear to have been treated using dummy variables, such 
as the approach given in Draper and Smith [27]. More recently the use of spline functions 
has been advocated [122, 106, 117] and this is the approach we follow. 

Consider, for some number zo, the two functions 


0 __ 0, qs Xo; 
(z — 20); = | E зод, (7.14) 
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Figure 7.3: The true model A of the data 


Figure 7.4: The true model B of the data 
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and 
=. 0, T < T0, 
(т = Z9), = { £ — T0, £ > Z9. (7.15) 
To model two straight lines simultaneously, we consider the linear model 
E (Y) = bo + 8, (x — xo) + 85 (z — х0)" + B5 (z — 20). (7.16) 


To see that (7.16) actually represents two lines, we consider z < ro and x > zo 
separately. Now if z € zo, then, 


E(Y) = bo + 8, (x — xo) = By — 81zo + 81x (7.17) 
and this is a straight line with slope 8; and intercept Bo — 8,20. If x > хо, then, 


E(Y) = 8+6; (x -— хо) + 85 + B3 (x — хо) 
= By — 8х0 — дзхо + B5 + (8) + 83) = (7.18) 


and this is a line with slope 3, + 84 and intercept 8, — 8,20 — 8320 + Gg. We also note 
that at x = zo the value of E (Y) on the first line is 8, while the value of E (Y) on the 
second line is д, + 85. Thus, we can test whether the slopes are equal by testing whether 
Вз = 0, while the assumption that xg is the abscissa of the point of intersection may be 
checked by testing whether 8, = 0. (We note that 8, is the difference in the vertical 
heights of the lines at ж = zo.) To illustrate some of these ideas numerically we consider 
the data given in Table 7.3. 
In this case the proposed model is 


Y = bo + bı (x — 5) + By (x — 5) + b3 (£ — 5), +e. (7.19) 
Using (7.14) and (7.15) we arrive at the following table of data to perform the regression. 
Table 7.3 Some data 


y To X1 22 T3 
2.3 1 -—4 0 0 
3.8 1 -3 0 0 
6.5 1 -2 0 0 
7.4 1 -1 0 0 
10.2 1 0 1 0 
10.5 1 1 1 1 
12.1 1 2 1 2 
13.2 1 3 1 3 
13.6 1 4 1 4 


where z; = (x — 5), 25 = (£ — B and хз = (x — 5),. Fitting this data by least squares 
we obtain 
j = 9.2 + 1.94 (z — 5) - 0.17 (x — 5); — 0.9 (x — 5), (7.20) 
as the estimated model. 
The £ values are: tọ = 27.33; tı = 13.09; # = —0.25; t3 = —3.51, and R? = 0.9917. 
Since there are 9 observations and 4 parameters there are 9 — 4 — 5 degrees of freedom 
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and this is significant at the 1% level since f3 40.01 = 12.1. Thus we can conclude that 
the overall fit is significant. 
To test the assumption that there are two distinct lines we test 


Ho : Вз = 0 against Ho : B4 Æ 0. (7.22) 


Since £3 = —3.51 and ts,9.025 = 2.571, Но can be rejected at the 5% level and it is quite 
reasonable to assume that the data is represented by two, rather than one line. 
To check whether т = 5 is the abscissa of the point of intersection we test 


Но: 8, = 0 against Ho : 8, Æ 0. (7.23) 

Since tg = —0.25 it is reasonable to conclude that 8, = 0 and therefore that the true 
model is of the form 

Y = Bg + Bi (£ — 5) + B3 (£ — 5), +E. (7.24) 


We leave it as an exercise to refit the model under the assumption that the abscissa 
of the point of intersection is zo = 5. 


Example 7.2 As a second example, consider the data set shown in Table 7.4 and 
plotted in Figure 7.5. 


Table 7.4 Some data 
1 2 3 4 5 6 7 8 9 
18 43 56 82 91 107 115 122 140 


Figure 7.5: Scatter plot of х and y 


Here it appears that the first 4 points lie on one line while the last five points appear 
to lie on another. Thus we assume a model of the form 


Y = bo +B, (x — 4) + By (x — 4)% + Ba (x — 4), +e. (7.25) 
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The data for this model are shown in Table 7.5. 
Table 7.5 Data for Example 7.2 


у Zo Zi 22 T3 
1.8 1 —3 0 0 
4.3 1 —2 0 0 
5.6 1 —1 0 0 
8.2 1 0 0 0 
9.1 1 1 1 1 
10.7 1 2 1 2 
11.5 1 3 1 3 
12.2 1 4 1 4 
14.0 1 9 1 5 


Fitting this model by least squares gives: 


Во = 8.05, 6, = 2.05, 6,=0.06, £3 = —0.92, 
to = 25.98, tı = 12.38, # = 0.12, їз = —4.53, 
R? = 0.995, and F = 311.94. 


Thus the overall fit is significant at the 1% level and we conclude that Bo Æ 0 at the 
1% level of significance. From our previous discussion, the estimated slope of the first 
line is 3, = 2.05 while that of the second line is 3, + 33 = 2.05 — 0.92 = 1.13. Thus the 
estimated equation of the first line is 


#1 = —0.15 + 2.052, 
while that of the second line is 
йо = 3.59 + 1.132. 


The abscissa of the point of intersection of the two lines is obtained by setting ў = йо 
and this gives 
3.59 + 1.132 = —0.15 + 2.052 or 0.922 = 3.74 


which yields x — 4.065. 


The approach we have described for fitting a piecewise linear curve can be extended 
to fit more complicated piecewise polynomial models. For example, to fit data described 
by three lines over the intervals (—оо, zo), [£o, z1), (z1, 00] we can use the linear model 


Y = 8+8, (2-20) + 83 (ж — xo) + b; (2 — 20), 
+8, (x — ту) + Bs (ш — 21), +E. (7.26) 
Equality of slopes may be checked by testing 
Но: Вз = Bs = 0 (7.27) 


against 


Hi:04 +0 or 85 #0 (7.28) 
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and this may be done by using the appropriate F test as described in Chapter 5. 
A piecewise quadratic model may be obtained by fitting 


Y = 8+8, (z — zo) + B (z — zo). Bs (x — zo); 
+84 (= — 21), + 85 (z—2i) Te. (7.29) 
where 
"NR 0, ZS To, 
(т Zo). mE | (т = то)? , T> то. (7.30) 
In this form continuity of y at = = хо can be checked by testing 
Но: 84 = 0 against H; : 84 #0 (7.31) 
while differentiability at x = х0 may be checked by testing 
Но: В, = 0 against Hy : 84 z 0. (7.32) 


We leave the verification of these as exercises for the reader to check. 

More complicated piecewise polynomial models may be developed along the lines we 
have indicated. For more details, we refer the reader to Refs. [87, 117]. 

One last comment should be made concerning the fitting of piecewise polynomial 
models. For instance in both of the examples given in this section, we assumed that we 
knew which points lay on which lines. In general, this will probably not be the case, so 
that we will have to estimate zo as well. One possible approach is to check every possible 
division of points into two lines. For example, in Example 7.2 we can take хо = 1,2, ...,9 
and the fit the model given by Eq. (7.25) for each possible choice of zo and then choose 
that value of zo which gives the largest R? (equivalently smallest SSE). For further details 
see Ref. [87]. 

Another possibility is to consider хо to be an unknown parameter and then one can 
estimate то along with the coefficients. This is a nonlinear regression problem - but 
is complicated by the fact that the mean function is not differentiable at хо so tradi- 
tional numerical methods for minimization based on calculus techniques are generally 
not applicable. 


7.2.3 Multivariate Polynomial Models 


If we are consider a polynomial regression model with two or more regressor variables, 
the approach will be a straightforward extension from the method of fitting polynomial 
models in one variable. Suppose that the postulated model is а second-order polynomial 
model in two variables. Then the model would be 


Ү = Bo + дух + Boxe + бү ү + Bases + By92122 T€ (7.33) 


where 3; and 8, are the two linear effects, 8}; and 8,4 are the quadratic effects of xı 
and тә respectively, and 8; indicates the parameter of an interaction effect. Then the 
regression function (or response function) 


E (Y) = Bo + 8,21 + Bore + 8,122 + 8212 + 8,2122 (7.34) 
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is called a response surface. This type of modeling is also called response surface method- 
ology, which is widely applicable to model output processes in industrial or engineering 
areas for finding the operation conditions to optimize a response. For more details and 
examples, see [10, 88]. 


7.3 Radial Basis Functions 


Although most regression models used in practice are parametric, i.e., the parameter 
vector has a direct physical interpretation, these are many situations where they do not. 
Such models are often referred to as nonparametric. Typical examples are polynomial 
and spline models. Even though splines are quite useful for fitting global data, they can 
be clumsy to use for higher dimensional data. Similarly, polynomials can cause difficulties 
if the range of the data is large and the dimensionality is high. To remedy some of these 
problems, mathematicians have been investigating a new class of functions, radial basis 
functions (rbfs). Although much of this work has been devoted to interpolating non- 
noisy data, increasingly they have been used to fit noisy data sets by statistical, often 
least squares methods. 

Typical applications, include approximatation of spatial data as occurs in mining 
studies, particularly, in the well-known statistical technique of kriging [23], environmental 
data, medical data and neural network modeling. 

To further elaborate on these functions we begin with a definition. 


Definition 7.1 Let ¢: [0,00) — R be a continuous real-valued function. Let ||.|| 
denote the Euclidean norm on R” and let (P; у be a set of N distinct points in R^. 
A radial basis function is a function of the form 


N 
f (P) = ) /&jé (IP -Р;|), Рек" (7.35) 


where TADE are unknown coefficients. These coefficients are generally determined 


from a given set of values of f at a possibly distinct set of points {9}, ,M > М. 
More generally, one often adds a polynomial term p,, of degree т to (7.35) so that 
the most general rbf is given by 


N 
f (P) = 58,0 (IP — Pl) + pm: (7.36) 


j=l 
: 1 Р Е N 
Then, the coefficients of pm have to be determined simultaneously with { B , jm " 


Although by definition, those is an infinite number of possibilities for ф, over the years 
a relatively small number of ¢’s have emerged as being particularly useful. 


7.3.1 Types of Radial Basis Functions 


Let т = ||P||, then typical radial basis functions are of the form; 
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(i) Ф (т) = exp(7cr?), с> 0, Pe К". 
These functions are referred to as Gaussian radial basis functions. It is known that 
Gaussian rbfs have optimal convergence properties and have found widespread use 
in neural network modeling [95, 100]. The parameter c is usually called the shape 
or variance parameter. Generally its value is unknown and its proper choice can 
have a substantial effeet on the quality of the fit. 


(ii) $(r) = Vr+c?, Pe m". 
Here ф (т) is called a multiquadric and c is the shape parameter. Again multi- 
quadrics have optimal convergence properties, with the shape parameter having a 
marked effect on the convergence rate and its proper choice has been a topic of 
continuing interest. Multiquadrics have found considerable use in fitting spatial 
data [36, 78] and in solving partial differential equations [43]. Interestingly they 
were first discovered by Hardy [50] in the course of fitting geophysical data. 


E r-logr, PeR, 

(iii) $ (т) = { r, P є R3, 
These functions are referred to as thin-plate splines (TPS) and are probably the 
most widely used rbfs. They are generally considered to be the first rbfs to be 
actively investigated and have found widespread use in fitting spatial data (see 
in particular, Whaba [117]), fitting technical data and in the solution of partial 
differential equations [42]. Their importance derives from the fact that they are 
optimal interpolants and are the natural generalization of one dimensional cubic 
splines [117]. 


In contrast to Gaussians and multiquadrics, which do not require the addition of 
polynomial terms, TPS require the condition of a first degree polynomial 


pi (P) 2a bz + cy, P = (z,y) in В? (737) 
and 
pı (P) =a +bz + cy + dz, P = (x,y,z) in ВЗ. (7.38) 
In this case, in order to properly fit the data it is necessary to impose the constraints 
N N N 
3.8; = y» — У biy == 0, (т, у;) € R?, (7.39) 
j=l j=l j=l 
N N N N 
j=l j=1 j=l j=l 


Я r?” орт, PcmR?,n22 

(iv) $ (r) x p2n-l Pe R3, п> 2. 
These rbfs are called higher-order polyharmonic or Duchon splines. These rbfs are а 
direct generalization of thin plate splines (TPS) and generally their approximation 
power increases with n, but this is counteracted by the need to add a polynomial of 
degree n which causes increased ill-conditioning as n increases. As for TPS, there 
rbfs can be obtained from a least squares principle [117]. Although they have been 
known for almost 30 years, they have yet to find widespread use in practice - most 
analysts generally settle for the simpler TPS [116]. 
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(v) Over the years, a persistent criticism of rbfs has been the fact that the well-known 
rbfs have global support and so matrices associated with their approximation are 
not sparse. For many years, the holy grail of rbf theory was to find a class of rbfs 
with compact support and whose related interpolation matrices are invertible for 
arbitrary data sets. 


In the mid nineteen-nineties this problem was resolved by Wu [124] and improved 
upon by Wendland in [119]. Without going into details all of the these functions 
have the form 


far pe. 0<тг<1, 
ө 0 е (7-41 
where ü y 8 , 
m er 3 < T < Ы 


and р (т) is a polynomial whose degree depends on the dimension of ће data space. 


If one scales r by, r — т/а, then we obtain rbfs supported or 0 € т < a. From (7.41), 
it follows that 


As for Gaussians and multiquadrics, the scale parameter a has a substantial effect on 
the accuracy of the approximation by ¢, (т). As a increases, the interpolation matrices 
became less sparse, while their approximation properties improve. Finding the ‘optimal’ 
value of a to balance these competing effects is a difficult and not a totally solved problem. 
Some methods for doing this can be found in [43]. 

We next turn to methods for determining the various parameters in (7.35) and (7.36). 


7.3.2 Fitting Methods for RBFs 


At present, there are three major methods for determining the unknown parameters 
in rbf approximations: interpolation [96], smoothing interpolation [117] and linear and 
nonlinear least squares [100]. Generally, one uses interpolation in the non-statistical 
context (but not always, since the well-known statistical technique of kriging is known 
to be equivalent to rbf interpolations). The latter two techniques may be viewed as 
generalizations of interpolation. Hence, we begin with a brief discussion of interpolation 
and to simplify matters, we restrict ourselves to rbfs where the polynomial p,, = 0. 
These include Gaussians, multiquadrics and Wendland's compactly supported rbfs and 


rbfs such as the inverse multiquadrics ¢ (r) = (r? + c : 


Interpolation 


For interpolation we assume that Ө, = P,,1 < k < N and assume that we know 
ук = f (Px). (Sometimes f is known, but usually in the statistical context it is not.) In 
this case we equate the right hand side of (7.35) to ук,1 < k < N. This gives 


N 
36,6 (IP. - PI) =y 1 <k <N. (7.44) 
j=l 
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Letting B = (81,85, BN)” 
Ф = [(1Рь –Р;|)|,1<3<№1<Е< М (7.45) 
and у = (y, yo; YN) , then (7.44) can be written in matrix-vector form 
ФВ = у. (7.46) 


For Gaussians and the compactly supported rbfs, Ф is positive definite [119, 95], hence, 
invertible. For multiquadrics Ф is invertible, but not positive definite while for inverse 
multiquadrics ® is positive definite. Hence, 


В= Ф -!у (7.47) 


so that 3 is uniquely determined. 


Smoothed Interpolation 


When the data are noisy, interpolation may not be appropriate. In this case, rather 
than choosing [3 to exactly satisfy (7.44), we attempt to smooth the fit by preventing 
the approximation from exactly passing through the data. Note first that interpolation 
is equivalent to minimizing the residual sum of squares. 


L= (y —- ®8,у — 9p) (7.48) 


with respect to . 

Adding a penalty term of the form А (8, 8) = A || Bl ‚А > 0 to L, we now determine 
B by minimizing 

І = (y - ®B,y — 98) + A|g|" (7.49) 

with respect to 8. In this form this approach is equivalent to the ridge regression problem 
discussed in Section 9.5. As shown there, the problem of choosing the ridge or smoothing 
parameter A is a non-trivial problem and much of the theory over the past thirty years 
has centered on that problem. 

In Wahba’s work, she focuses on the use of cross-validation which is briefly discussed 
in Section 9.5. Further details can be found in [117]. 


Least Squares Fitting 


In this approach we assume that the data points {Qi} y ‚ М > М, are generally distinct 
from the centers (P; Ja . In this case the rbf model can be written in standard regression 
form 


N 
ук = > Bjo (P; – 91) -e 1 < k < M (7.50) 


j=l 


where {er} are independent N (0, c?) random variables. Using our previous notation 
(7.50) can be written in standard regression form as 


Ү=ХЗ-+є (7.51) 
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where 
X = [¢(|P; -Qel)], 1S <N, 1€ k € M. (7.52) 


Hence B can be determined by minimizing the residual sum of squares 
L= (y -X8,y - X). (7.53) 


Here there are number of distinct possibilities. If (P; | м and the shape parameters 


are also known, then minimizing L is a standard regression problem and # is given by 
(5.22a), i.e., 
B= (XTX) XTy (7.54) 
where X = [Ф(||Р;— Ql], 1 € j € М, 1 € k € M, and an unbiased estimate of g? is 
given by 

(y - XB,y — хд) 
ep MERE 
One can then use all of the machinery of linear regression theory to evaluate the adequacy 
of the model. 

In the second case, the centers (P; | аге not known and must be determined from 


the data. Generally, the shape parameters will be unknown as well. As before, we 
minimize 


6? (7.55) 


L = (у – ХВ,у — XB) (7.56) 


with respect to B. However the centers and shape parameters appear nonlinearly іп Г 
and we minimize L with respect to (P; 2 and the shape parameters c simultaneously. 
This is then a nonlinear regression problem and can be solved by solving the non-linear 
normal equations: 
ðL/ðP;=0, 1 <j < №, (7.57) 
д1 [дс = 0. 


Unfortunately, these equations generally do not have an analytic solution as in the 
linear case. In this case (7.57) have to be solved by some numerical iterative method, 
such as the Gauss-Newton method. Further details can be found in [27]. 


7.4 Dummy Variables 


As we have already pointed out in Chapter 5, one of the important properties of the 
GLM is its ability to incorporate both qualitative as well as quantitative variables in the 
model. In many areas such as medicine and social sciences, there are often many more 
qualitative factors than quantitative ones and the ability to account for the effects of 
these factors is one of the great attractions of using the GLM. In addition to the effect 
of gender discussed in Example 5.15 qualitative factors of all sorts occur in scientific 
studies. In economics, seasonality factors are important and sociological studies often 
require one to account for geographical differences. In [64] the authors considered the 
effects of various factors affecting the survival of patients following admission of patients 
to a hospital intensive care unit. Among qualitative factors considered were: sex, race, 
presence of cancer, history of renal failure, previous admission, ph from blood gases and 
many others. Of 19 independent variables only three were quantitative. 


328 CHAPTER 7. FURTHER APPLICATIONS OF REGRESSION TECHNIQUES 


In this section we expand on our work in Chapter 5 showing how to incorporate 
multiple qualitative factors, qualitative factors at more than two levels and interactions 
into the GLM. In addition, we give a brief comparison of the use of dummy variables and 
the more traditional analysis of variance. We begin by recapitulating some ideas from 
Chapter 5. 

Consider the following hypothetical salary data for eight professors at “good ole” 


B.S.U. 
Table 7.6 Hypothetical Salary data 


Rank No. Salary (in thousands of $) 


1 

Associate 2 
Professor 3 
4 

5 

Full 6 
Professor 7 
8 


We would like to determine whether there is a difference in the mean salaries of full 
and associate professors. The standard approach to such a problem is to use the t-test 
for testing the difference between two means. An alternative (but equivalent) approach 
which has significant generalizations is to make a linear model for the professors' salaries 
by using dummy variables and then utilizing a standard t-test for a regression coefficient. 

Now the variable of interest here is the rank of a professor, a qualitative variable. We 
quantify this variable in the following way: Let 


us 0, ifthe professor is an associate, 
| 1, ifthe professor is full. 


If Y denotes the salary of an arbitrary professor then (assuming that rank is the only 
explanatory variable) we claim that 


where Bo =Е (овоне) апа Bo + £1 =E (Үш) so that 
By = Е (Yputt) -E (Yassociate) (7.59) 


represents the difference in the mean salaries of full and associate professors. 

To verify this, observe that if we take z = 0, then Е (Y) = у, and if z = 1 then 
E(Y) = 8, 8,. If the errors are normal with constant variance, then all inferences 
concerning the professors’ salaries can be made using the linear model. For example, 
differences in the mean salaries can be checked by testing 


Ho : 9, = 0 against Ну: 8; #0. (7.60) 


This test can be shown to be equivalent to the usual t-test for the difference of two normal 
means. The parameter estimates (as follows from (3.18) and (3.19)) have their expected 
values; i.e., 


4 
> 1 
Во = 4 » Жазаны) = 27.0 (7.61) 


i=1 
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and Я А 
Bg + By = 38.0, (7.62) 


which is the average value of the full professor’s salaries. We also find that 
to = 15.123; tı = 4.3566 (7.63) 


where tı has the same value as the classical t-statistic for testing the difference of two 
normal means. From this it follows that the difference in mean salaries is significant at 
the 5% level. 

We now consider extending this model for the purpose of testing whether there is a 
difference among the mean salaries for full, associate and assistant professors. Again we 
try to make a salary model which depends only on the rank of a professor. We do this 
by introducing two dummy variables as follows: 


re 1, if professor is an associate, 
: 0, otherwise, 

n 1, if professor is an assistant, 
2 = 


0, otherwise. 
The salary model is now of the form 
Y = bo + 8,21 + Baa +E (7.64) 


where 
E (Уһи) = Во, 
Е (Yassociate) = Bo T 8}, 
Е (Yzssistant) = Bo T Bo. 


To see this, let ту = z2 = 0, then E(Y) = Go. But zı = r2 = 0 if and only if a 
professor is a full professor. Similarly, if x; = 1 and x2 = 0, this indicates an associate 
professor and then Е (У) = 8, + 8,, while for an assistant professor x; = 0, шо = 1 
and Е (У) = 8, + 85. Thus 8; and 85 measure the difference between full and associate 
salaries and full and assistant salaries respectively. The test for overall differences in 
salaries is 

Но: 81 = 85 = 0 (7.65) 


against 
Hı: 8, #0 or 85 £0. (7.66) 


This may be done using the overall F test for the significance of the regression. This 
test is equivalent to the classical one-way analysis variance test for testing the difference 
between three means. The model given in (7.64) is then a particular case of the one-way 
ANOVA model. 

The general case of a qualitative factor at k levels may be treated by generalizing the 
discussion given above. We do this by introducing k — 1 (0-1)-valued dummy variables 
in the following way: Let 


pa { 1, if the observation is in the j-th category, (7.67) 


0, otherwise, 
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j =1,2,...,k — 1. The model for Y is then given by 


k—1 


Y = bo + У Bx; +e, (7.68) 


j=1 


where the expected value of observations in the k-th category is до and those in the 
j-th category, 1 € j < k — 1, have expected values By + 8;. (The k-th category is often 
referred to as the excluded category.) To test whether there is a difference between the 
means of the k categories we test 


Ho:8, = ba =+ = 0 = 0 (7.69) 


against 
Н; : at least one 8; £0, 1<j7 € k — 1. (7.70) 


This is nothing other than the F test for the overall significance of the regression. 


Example 7.3 We now consider an additional 4 assistant professors whose salaries 
are listed in the Table 7.7. 


Table 7.7 Salary data of Assistant Professor 
Rank No. Salary (in thousands of $) 


Assistant 10 23.0 


Professor 


If we introduce dummy variables as before, the data matrix for the linear model 


Y = Bo + 8421 t B 522 T€ (7.71) 


Table 7.8 Salary data using Dummy Variables 
Rank y то Ti T2 


Full 
Professor 


40.0 
37.0 


= =e e| e = ка ee ка me e 


Associate 33.5 
Professor 25.0 
Assistant 23.0 
Professor 30.0 


mÓ 


Fitting this model by least squares gives the results: 


^ 


Во = 38.0, Ву = —11.0, 8, = —13.0, 
to = 20.710, tı = –4.238, to —5.105, 
R? = 0.7683, and F = 14.925. 
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From this we see that the overall regression is significant at the 1% level so we reject 
the hypothesis of no differences between the mean salaries of professors of different ranks. 

We also note (as may be checked) that 8, = 38.0 is the average salary of the 4 full 
professors, Bo + B 1 = 27.0 is the average salary of the 4 associate professors and Bs + Bo 
is the average salary of the 4 assistant professors. 


Example 7.4 As a further example of the use of dummy variables we consider the 
results of an agricultural experiment. Such data are traditionally analyzed using ANOVA 
methods. 

Five varieties of peas were planted, each on 4 different plots. The yields in bushels 
per acre are shown in Table 7.9: 


Table 7.9 Pea Harvest Data 
Variety 
Yield [ATB] € гроте 
201 


: 
21 


To test whether there are differences in the yields of the five varieties we consider a linear 
model for the yield per acre as 


Y = Bo + B1z1 + Bote + хз + Bava te (7.72) 


where the z;'s are dummy variables for varieties B-E with A being taken as the excluded 
category. 
To fit this model we use the data given in Table 7.10. 


Table 7.10 Pea Data using Dummy Variables for Variety 
Variety y 24 
26.2 
24.3 
21.8 
28.1 
29.2 
28.1 
27.3 
31.2 
29.1 
30.8 
23.9 
32.8 
21.3 
22.4 
24.3 
21.8 
20.1 
19.3 
19.9 
22.1 


8 
© 


А 


= e e нны н e eje ee eje ка ка eje на ны вы 
ocooooclooooclooocj.e-oe-uooooc|83 
oooocloooocl-e-loooocoooo!B 
oocooloe-oooocoooocloooo!|3 
к к оҥ ыоооојооооороојооо о 
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This model was fit by least squares with the following results; 


Bo = 25.1, В, = 3.85, 8, = 6.55, 8, = —2.65, B, = –4.75, 
to = 26.6, = 2.88, 1 = 4.90, їз = –1.98, Ё = —3.56, 
R? = 0.8647, and F = 23.97. 


Here we can check that Bo is the mean value of the yields of variety A. Similarly, 
Bo + B, — 28.95 is the mean yield for B, Bo + Bo = 31.65 is the mean yield for C, 
Bo + f — 22.45 is the mean yield for D while Bo + B, — 20.35 is the mean yield for E. 

Since there are 4 degrees of freedom for regression and 15 degrees of freedom for error, 
we find that f4,15,0.01 = 4.89 so that we can reject Ho at the 1% level and conclude that 
differences in yields are highly significant. 

Of course, we may consider models with two or more qualitative factors. If, for 
example we have two variables, one with j levels and the other with k levels, the resultant 
model is usually called a two-way analysis of variance. 

For m qualitative variables, the model is m-way analysis of variance with m = 2 
being perhaps the most common case. Аз an example suppose we consider the effect of 
gender on professors's salaries. It turns out that the sex distribution is: 


Table 7.11 Rank and Gender of Professor 
Rank of Professor No. Sex 


Associate 


Assistant 


To account for this additional variable we introduce another dummy variable 


_ | O, if male, 
imet 1, if female. 


'The model is now of the form 


Y = bo + 8121 + Bore + 8373 + € (7.73) 
eee paiement” Sent 
rank sex 


where 


Во = mean salary of a male full professor, 

Во + B4 = mean salary of a male associate professor, 

Во + B4 = mean salary of a male assistant professor, 

Bg + Вз = mean salary of a female full professor, 

Во + B4 + Вз = mean salary of a female associate professor, 
Bg + 85 + Вз = mean salary of a female assistant professor. 
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Thus, (21,22) measures the effect of rank and z3 measures the effect of sex. The 
effect of sex can be determined by testing Ho : дз = 0 against Hı : G3 Æ 0 using the 
usual ¢ test. To do this, we fitted the model (7.73) using the data in Table 7.12. 


Table 7.12 Salary Data using Dummy Variables 
Rank y то T1 тэ I3 


о 


Associate 


Assistant 


Ooo O]}F нн ноосоо 
RPrRrerPIPOocoooonoo 


The results were; 


Во = 38.88, ĝi = –10.71, 8, = 13.8, 02, = —1.17, 
to = 12.74, tı = 3.77, = —439, із = —0.37, 
R? = 0.7722, and F = 9.04. 


From this we see that the overall regression is significant at the 1% level but the small 
value of t3 indicates that sex is not a significant predictor. (This is not surprising since 
the example was made up by randomly assigning the sex of the professor by tossing a 
coin.) 

Of course a model may contain quantitative as well as qualitative factors, faculty 
salaries may depend on age, number of publications, length of service and many others 
as well. If these factors contribute linearly then they will be added to the model, just 
as the qualitative factors. So for a linear model for professors’ age z4 and number of 
publications z5, the model would be 


Y = bo + 821 + 8522 + Das + B4za + 8525 + Е. (7.74) 


Example 7.5 Let’s consider the jewelry sales data in Example 3.21. To adjust for 
seasonal trends, one of the common methods is to take a moving average (over a whole 
year) of the time series. Here we introduce dummy variables as a tool to adjust for the 
seasonality in the data. Now consider the model with three dummy variables, q2, 9з and 
94, as follows: 

Y = 8+ By 21 + (892 + заз + 8404) + = (7.75) 
where q; (i = 2,3, 4) indicates the i-th quater effect in sales. In fact, q; measures the shift 
from a first quarter base. Then, the model (7.75) becomes 


(i) Y = bo + B4z1 + є; for the effect of the first quarter, 

(ii) Y = Bg + B4z1 + 8,9 + €; for the effect of the second quarter, 
(ui) Y = o + 821 + 8395 + =; for the effect of the third quarter, 
(iv) Y = Bg + 8,21 + B4q4 + =; for the effect of the fourth quarter. 
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The data for estimating {3 is given in Table 7.13. 


Table 7.13 Jewelry Sales data using Dummy Variables 


Year Quarter (zi) Sales Y (in $100,000) ф» q3 Q4 
1 36 0 0 0 
2 44 1 0 0 
о 3 45 0 1 0 
4 106 0 0 1 
1 38 0 0 0 
2 46 1 0 0 
Miis 3 47 0 1 0 
4 112 0 0 1 
1 42 0 0 0 
2 40 1 0 0 
1959 3 48 0 1 0 
4 118 0 0 1 
1 42 0 0 0 
2 50 1 0 0 
1209 3 51 0 1 0 
4 118 0 0 1 


This model was fit by least squares with the following results; 


By = 34.95, 8, = 0.65, B,— 10, 8, = 6.95, 8, = 72.05, 
to = 32.12, tı = 6.79, Ё = 5.84, із = 5.67, ts = 57.86, 
R^— 0.998, and F = 1230.41. 


From the values of the £-statistics and the coefficient of determination in the results, 
and the fact that F = 1230.41 > ја 11,0.01 the overall regression is significant at the 1% 
level. Hence, we conclude that the dummy variables adequately explain the seasonality 
of the quarterly sales in the data. For example, we are able to know that fourth quarter 
dominates the seasonality in the sales because the average seasonal shift for the fourth 
quarter is, 84 = 72.05. 

The advantage of using dummy variables is that both seasonal shifts and the rela- 
tionship of sales (Y) to time (ту) are estimated simultaneously in the same regression 
model. We also note that the slope f is now an unbiased estimate of 3}. 

In Figure 7.6, the residuals from the model (7.75) are now rather evenly scattered 
from zero in which the two groups represent the first three quarters (low-sales season) 
and fourth quarter (high-sales season) respectively. The residual plot seems to show 
that the errors are uncorrelated which is probably accounted for by the adjustment for 
seasonality in the model. 


7.4.1 Further Comments on Dummy Variable 


Although we have discussed dummy variables using 0-1 coding it is possible to use other 
coding as well. For example, qualitative variables, with two levels are often coded 1-2 
rather than 0-1. Generally 0-1 coding is preferred because the regression coefficients are 
usually easier to interpret and the effect of multicollinearity is reduced. For a factor at 
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Figure 7.6: Residuals versus fitted values for jewelry sales data 


three levels 0-1 coding can be displayed by a 3 x 2 table 


Tı T2 
level 1 1 0 
level 2 0 1 (7.76) 
level 3 0 0 


Note that the columns are linearly independent vectors so that arbitrary codings may be 
used as long as the possible values are linearly independent. For example, 


X1 X2 
level 1 1 0 
level 2 —1 1 (7.77) 
level 3 0 0 
is permissible while 
21 T2 
level 1 1 —1 
level 2 1 —1 (7.78) 
level 3 0 0 


is not. 

This is easily extended to a factor at k levels - the coding used must be given by 
k — 1 linearly independent vectors. In addition, one must be careful not to use too many 
dummies since that can inadvertently cause the model to have less than full rank. For 
example if we have a factor at 2 levels and use two dummies 


21 22 
level 1 1 0 (7.79) 
level 2 0 1 


опе can see that 21 + 22 = 1 so the columns of the design matrix are linearly dependent. 
An example of this can be found in [27]. 
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If one wishes to do ANOVA using a standard regression model, then one needs to 
make sure that dummy variables are introduced which make the design matrix have full 
rank. On the other hand, traditional ANOVA models use linear models with overspecified 
variables resulting (by design) in design matrices which are less than full rank. In this 
case the standard form of the GLM cannot be used and additional constraints need to be 
introduced to allow the parameters to be estimated. This usually results in a constrained 
least squares problem. From this point of view, traditional ANOVA is mathematically 
more complicated than regression analysis. 


Example 7.6 As an example of the above remarks we re-examine the professors 
salary data in Table 7.6. Let Yj, be the salary of the k-th professor in the j-th category. 
Here, 7 = 1 for an associate and j = 2 for a full and 1 € k < 4. In the traditional 
ANOVA approach we would use a model of the form 


Үк = tay + Ejk, j = 1,2, k = 1,2,3,4. (7.80) 


From Table 7.6 we get 
22.5 = Y u+ О + 11, 
33.5 = Yy2 = p + Q1 + £12, 
25.0 = үз = p + 01 + £13, 
27.0 = Via = u 01 + €14, 


39.0 = У = u + 09 + E21, (7.81) 
40.0 = Yoo = u + a2 + E22, 
37.0 = Үз = u + a2 + €23, 
36.0 = Уд = u + Q2 + E24. 
Letting y= (Yii, Yiz, TQ You)’, B = (и, 1, az)” and 
1 1 0 
1 1 0 
1 1 0 
1 1 0 
X= 10 1! (7.82) 
1 0 1 
1 0 1 
1 0 1 
the data can be represented as a linear model 
Y-X -e. (7.83) 


However, notice that X has rank two because the sum of the last two columns equals 
the first. Hence, XTX is not invertible so the coefficients (u, 01, ог) cannot be obtained 
as before. In effect, the usual ANOVA model is over parameterized, having three rather 
than two parameters necessary to represent the differences in salary. The regression 
approach gives the correct number of variables for a nonsingular model. 

To obtain an estimable model from (7.83) we need to impose a constraint on the 
parameters. Typically one uses o + a2 = 0. In this case we can eliminate a2 from the 
model which can now be represented as 


Y=XG+e (7.84) 
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where 8 = (ш, ол)? and 


(7.85) 


l2 pA p pa pa pa pa pa 
A 


which is a full rank model. 
Further examples will be given in the Exercises. 


7.5 Interactions 


As for quantitative variables one can have interactions between qualitative and quanti- 
tative variables. As we indicated in the Example 5.5 an interaction between a qualitative 
variable with two levels and a quantitative variable can be used to model the possibility 
that the slope of the quantitative variable is different for each level of the quantitative 
variable. Geometrically, this can be represented by a model representing two non-parallel 
lines. For a qualitative factor at k levels an interaction with a single quantitative variable 
implies that the model represents Ё non-parallel lines. 

For example, in modeling professor salaries, suppose salaries depend on age, but the 
rate of change of salary (slope) depends on the professor's rank. Then, if 21,22 are the 
dummy variables used to represent rank and if z4 denotes age, the interaction between 
rank and age would be modeled by adding terms of the form 


В52124 + 72234 (7.86) 


to the model (7.64). In general, an interaction (more specifically a two-way interac- 
tion) between a quantitative variable ту and a qualitative variable with k levels can be 
represented by adding terms of the form 


12122 + "ya*iX3 +: + ур12126, (7.87) 


where z;,1 € j € k— 1 are the dummy variables used to represent the levels of the quali- 
tative variable. That is, we enter into the model all possible products of the quantitative 
and qualitative variables. If zı is a qualitative variable having two levels then (7.87) can 
be coded to represent an interaction between it and the second factor having Ё levels. 
For example, an interaction between sex and rank would be represented by terms of the 
form in (7.73) 

ү12123 + 729223. (7.88) 


More generally, an interaction between a qualitative factor at k levels and another 
at 7 levels would be represented by adding all possible products of the dummies taken 
two at a time. For example, an interaction between a qualitative factor at three levels 
represented by two dummies zı and х2 and one with four levels represented by 23, £4 
and zs would be represented by adding a term of the form 


(41123 + YaL1L4 + "ya31195 + 42223 + "51274 + 61225 (7.89) 
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to the model. In general these will be (j — 1) (k — 1) new variables added to the model. 

Ав one can see, adding interactions very quickly increases the number of independent 
variables and hence, the complexity of the model. To make matters worse, one may 
wish to consider 3-way, 4-way and higher order interactions into the model to cover all 
possible interactions between all the variables. If one wishes to be very cautious about 
the possible interactions between variables one would enter all of them to begin with and 
test to see if any are significant. Unfortunately, this can cause substantial difficulties, in 
both interpretation and computation. 

For even а small number of initial variables, including all interactions can easily 
produce a model with more variables than observations and hence the assumptions of 
the GLM fail. Even if that does not occur, the introduction of interactions can cause 
substantial multicollinearity in the model, similar to that in using high order polynomial 
models for quantitative variables. 

This can make the model difficult to interpret and may produce strange anomalies 
where £ or F' tests can show significant interaction effects but non significant direct effects. 
Moreover, some, but not all of the coefficients in an interaction could be significant but 
not others. 

Аз a partial remedy for the multicollinearity one can use centered variables or partial 
orthogonalization as indicated in (27, 87]. To illustrate some of those possibilities we 
consider data given in [99]. 


Example 7.7 (Pulse Data [99]) In an experiment 92 students measured their pulse 
rate, then each student was asked to flip a coin. If the coin came up heads, then they 
were asked to run in place for one minute. Then everyone was asked to measure their 
pulse rates again. This second pulse rate was recorded and various other factors were 
recorded. They were: 


G (group: 1 — ran in place, 2 — did not run in place) 
K (smoker: 1 — smokes regularly, 2 — does not smoke regularly) 
S (sex: 1 — male, 2 — female) 


Letting Р, = first pulse rate and Р» = second pulse rate, it was desired to model P, 
as a function of (P1, G, К, 5). For this a linear model with all possible interactions was 
considered. The model was of the form 


Y = Bo + ӨВҮР, + С + B3 K + 845 + 8565 + ©К + 8:5 К 
+8в©5К + ByP1G + ByoPiK + B4,P,8 + 8;РС5 
+813 PGK + b14 PGK + 815Р,С5К + Е. (7.90) 


Under the assumptions of ће GLM the model was fit and the results аге shown in Table 
7.14, which includes the values of the regression coefficients, their standard errors (S.E.), 
t-ratios, p-values, variance inflation factors (VIF), and sequential sums of squares (Seq 
SS). 
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Table 7.14 Full Model for Pulse Data 


Predictor Coefficient S.E. t-statistic p-value VIF Seq SS 
constant 149.9 228.5 0.66 0.514 - - 
Р, —1.577 2.976 —0.53 0.598 1629.6 100961 
G —51.6 152.3 —0.34 0.735 8389.8 7908.0 
K —91.0 138.3 —0.66 0.513 6219.3 116.7 
S 5.3 168.6 0.03 0.975 102786 1087.0 
GS —15.5 124.3 —0.12 0.901 29317.3 2129.0 
GK 29.74 88.41 0.34 0.738 16003.6 295.6 
SK 30.99 98.14 0.32 0.753 18549.5 62.2 
GSK —4.67 68.55 —0.07 0.946 40712.3 11.1 
PG 1.032 1.926 0.54 0.594 8712.7 122.4 
PK 1.402 1.846 0.76 0.450 6878.1 51.9 
PS 0.504 2.085 0.24 0.810 12649.2 61.8 
PGS —0.121 1.494 —0.08 0.936 27402.2 12.6 
РСК —0.523 1.150 —0.45 0.651 15125.2 49.6 
РСК —0.399 1.242 —0.32 0.749 18958.3 30.1 
P,GSK 0.0772 0.8429 0.09 0.927 35106.2 0.5 
Table 7.15 ANOVA for Pulse data 
Source df Sum of Squares Mean Squares F 

Regression 15 22034.5 1469.0 24.51 

Residual 76 4555.5 59.9 - 

Total 91 26590.0 - - 


иЕе ОСЕ ——— — 
R = 0.795 


R? = 0.829 


Table 7.16 Full Model using 0-1 Dummy Variables 


Predictor 
constant 
Р, 

G 

K 

S 

GS 

GK 

SK 
GSK 
PG 
PK 
PS 
Р,С5 
РСК 
РСК 
Р,С5К 


Coefficient 


1.25 
0.9716 
16.99 
9.88 
17.65 
24.83 
25.06 
—21.65 
—4.67 
—0.0196 
—0.1115 
—0.2264 
—0.0335 
—0.4458 
0.2441 
0.0772 


S.E. 
13.14 
0.1839 
22.91 
18.13 
19.06 
33.40 
32.23 
55.27 
68.55 
0.3276 
0.2543 
0.2635 
0.4512 
0.4515 
0.6582 
0.8429 


t-statistic 
0.10 
5.28 
0.74 
0.54 
0.93 
0.74 
0.78 
—0.39 
—0.07 
—0.06 
—0.44 
—0.86 
—0.07 
—0.99 
0.37 
0.09 


p-value 
0.925 
0.000 
0.461 
0.587 
0.357 
0.460 
0.439 
0.696 
0.946 
0.952 
0.662 
0.393 
0.941 
0.327 
0.712 
0.927 


VIF 

6.2 
189.9 
106.8 
131.5 
180.2 
180.8 
372.3 
300.0 
218.2 
123.5 
153.7 
221.4 
211.5 
384.3 
314.1 


Seq SS 


10096.1 
7908.0 
116.7 
1087.0 
2129.0 
295.6 
62.2 
11.0 
122.4 
51.9 
61.8 
12.6 
49.6 
30.1 
0.5 
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As one can see in Table 7.15 the F-statistic indicates that the variables taken as a 
whole are significant, but the £ values show that none of the variables individually is 
significant. As we noted in Chapter 5 this behavior is indicative of multicollinearity and 
that is further substantiated by the high variance inflation factors. At this point it is 
difficult to interpret what the model is doing. 

As a first step towards mitigating the multicollinearity, the qualitative variables were 
recorded from 1-2 to 0-1. The results are displayed in Table 7.16. The ANOVA is the 
same as Table 7.15 because a change in the scale does not affect it and now at least Р, 
is significant. Moreover, the VIFs are reduced by two orders of magnitude. Again it is 
difficult to interpret the fit. To further reduce multicollinearity all variables (P1, G, K, S) 
were centered 


P! = Р-Р, G’=G-G, 85 =S-S,K=K-K (7.91) 


апа the interactions written in terms of the centered variables. Again the model was fit 
and the results shown in Table 7.17. The effects are dramatic. The variance inflation 
factors (VIFs) are < 3 (recall VIF = 1 for orthogonal variables) and now Р), С, S and 
GS are significant. This suggests that the reduced model 


Y = bo + 81Р, + 0С + 848 + 8,05 te (7.92) 


might be appropriate. A further fit using orthogonalized variables gave results shown in 
Table 7.18. 


Table 7.17 Full Model using Centered Variables 
Predictor Coefficient S.E. t-statistic p-value VIF беа SS 


constant 80.919 1.057 76.53 0.000 - - 

Р, 0.81929 0.09264 8.84 0.000 1.6 10096.1 
G —21.935 2.085 —10.52 0.000 1.6 7908.0 
K 2.400 2.701 0.89 0.377 2.4 116.7 
S 8.604 2.400 3.58 0.001 2.1 1087.0 
GS —22.677 4.664 —4.86 0.000 1.8 2129.0 
GK —7.055 4.998 —1.41 0.162 2.0 295.6 
SK 3.492 6.497 0.54 0.592 3.0 62.2 
GSK 0.95 11.77 0.08 0.936 2.4 11.0 
PG 0.1590 0.1887 0.84 0.402 1.6 122.4 
РК 0.1771 0.2016 0.88 0.382 2.0 51.9 
PS —0.1559 0.1927 —0.81 0.421 1.6 61.8 
Р,С5 0.0100 0.3814 0.03 0.979 1.7 12.6 
РСК —0.4164 0.3893 —1.07 0.288 1.8 49.6 
РСК —0.2735 0.4543 —0.60 0.549 2.5 30.1 


PGSK 0.0772 0.8429 0.09 0.927 2.2 0.5 
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Table 7.18 Full Model using Orthogonal Variables 
Predictor Coefficient S.E. t-statistic p-value VIF Seq SS 


constant 43.134 7.516 5.74 0.000 - 

Р, 0.76401 0.08419 9.08 0.000 1.3 10096.1 
G —20.846 1.732 —12.04 0.000 1.1 7908.0 
K 2.015 1.902 1.06 0.293 1.2 116.7 
© 8.358 1.853 4.51 0.000 1.2 1087.0 
GS —22.819 4.108 —5.55 0.000 1.4 2129.0 
GK —7.308 3.797 —1.92 0.058 1.1 295.6 
©К 2.208 4.852 0.46 0.650 1.6 62.2 
GSK 1.542 9.862 0.16 0.876 1.6 11.0 
PG 0.2262 0.1794 1.26 0.211 1.4 122.4 
PK 0.1848 0.1777 1.04 0.302 1.4 51.9 
PS —0.1366 0.1826 —0.75 0.457 1.4 61.8 
Р,С5 0.0113 0.3811 0.03 0.976 1.4 12.6 
РСК —0.4236 0.3813 —1.11 0.270 1.6 49.6 
РСК —0.2912 0.4111 — 0.71 0.481 1.4 30.1 
Р,С5К 0.0772 0.8429 0.09 0.927 1.0 0.5 


Now the VIFs are all < 2 and five variables are significant. Moreover, now СК 
is marginally significant with а t value of —1.92. However, K itself does not seem to 
be significant, indicating the type of anomaly suggested previously. Since logically one 
would expect the direct effect of a variable to be important if its interaction with an- 
other variable is, a general rule is to include all direct effects when their interactions are 
significant. 

As a consequence, a reasonable model would be to include the variables P, С, K, S, 
GS and GK as predictors. To test this, the model was refit using these variables. 


Table 7.19 Model using Р, С, К, S, GS and GK 
Predictor Coefficient S.E. t-statistic p-value VIF бед SS 


constant 80.6367 0.8024 100.49 0.000 - - 

Р, 0.76297 0.07786 9.80 0.000 1.1 100961 
G —21.012 1.663 —12.64 0.000 1.0 7908.0 
K 8.891 1.769 5.03 0.000 1.2 1087.0 
S 1.885 1.786 1.06 0.294 1.1 116.7 
GS —8.030 3.567 —2.25 0.027 1.0 405.6 
GK —20.649 3.510 —5.88 0.000 1.0 2019.0 


Now one sees that all variables with the exception of S are significant. However since 
GS is significant, S should be included for logical completeness. It appears that we can 
settle for the six variables model as the best choice. 


7.6 Logistic Regression Revisited 


As indicated in Section 6.6 regression methods can be used to model data with binary 
0-1 responses as well as 0-1 predictors. Typically, when one has binary responses, binary 
predictors occur as well and such models play an important role in interpreting data from 
medical, social science and engineering experiments. Since models with binary responses 
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do not have normal errors, one needs to be cautious about using normal theory regression 
methods without careful examination of the assumptions. 

In Section 6.5 we assumed that one might model the frequency of a positive drug 
response 7 (x) as a linear function of the predictor x. However, 7 (x) is а probability so 
it must satisfy 0 < 7 (x) < 1. In general, a linear function 85 -F 5 77 , 8;2; will not satisfy 
this condition, so such a model seems to be mathematically inconsistent. To model such 


data, it appears reasonable to assume that ((8, x) = bo + jel 8 E 


7 (x) = 9 ((8,x)) (7.93) 


for some function Ф such that 0 € Ф (y) € 1, —oo < y < оо. Examination of experimental 
data often show that 7 (x) is an S-shaped curve as shown in Figure 7.7. 


E(Y) E(Y) 
1.0 1.0 
0.5 0.5 
0 0 
x 
(а) Monotonic increasing (b) Monotonic decreasing 


Figure 7.7: Examples of logistic response functions 


The S-shape of Ф is typical of a cumulative distribution function (cdf) F(z) of a 
random variable X. Typical ®’s are the cdf of a standard normal random variable, 
referred to as the probit function or the cdf of a logistic random variable. 


|j G+ e*)! sp т>0, 
oes { 0, otherwise on 


seems to be the most common choice, particularly in the medical and social sciences. 
Other choices are cdfs of extreme value and Weibull random variables which are often 
used in reliability theory. Here we focus on the logistic function, since it is widely used 
and allows a useful probabilistic interpretation of the coefficients in (7.93). 

From (7.94) it follows that by solving 


exp ((8, x)) 
1 + exp ((B, x)) 


for (B, x) in terms of 7 (x) that 


= m (x) (7.95) 


log EE = (B, x) (7.96) 
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and this suggests that if Y; is the relative frequency r;/n; of successes when x = x;, that 
a suitable model for binary data is of the form 


log (i) кеу уын нр, (7.97) 


That is we regress 


Y; 
logit (Y;) = log (; = z) = Zi (7.98) 


on x;. To determine an appropriate estimation method we need to examine the structure 
of the errors є;,1 < i € n, hence of Z;. 

Now if the observations are independent, then r;, the number of successes in n; obser- 
vations, has a binomial distribution with E (r;) = nj7 (x;) and Var (r;) = nix (xi) [1 — 
т (x;)|. Hence, 

E (Yi) = E (ri/ni) = т (xi) (7.99) 

and ) 
Var (Y;) = Var (т;/т;) = at (xi) [1 — 7 (x;)]. (7.100) 

1 


When n; is large, then it follows from the Central Limit Theorem that r; is approximately 

N (nin (zi), nin (zi) [L — 7 (z,)]) and Y; is approximately N (7 (zi) , т (zi) [L — 7 (2;)] /ni). 

Using this we can determine the approximate distribution of logit(Y;) for large n. 
Letting f (Y;) = log [Y;/ (1 — Y;)] it follows from Taylor's theorem that 


(У) = f (шщ) + (Yi — ш) f' (и). (7.101) 
But, f (Y) = log (У) – log (1 У) so that f' (Y) = 1/У +1/(1—Ү) and then 
7 (xi) 1 1 
oe [ois] tM ау ау 
| | Y; — 7 (xi) 
logit [т (x;)] + ОПЕЕ 


Since Y; is approximately N (т (х;), т (xi) [L — 7 (x;)] /т;), f (Yi) is approximately а lin- 
ear transform of a normal random variable so is approximately normal with 


n 


fY) 


(7.102) 


B fD] = log [E (7.103) 
and 
1 
Var |f (Yi — Var |У; — 7 (xi 
тте 
E 1 7 (xi) [1 — 7 (x;)] 
{лт (xi) [1 — т (х,)]}° ni 
1 


тт (эч) 1 — т (х)]` (7.104) 


Hence, logit(Y;) is approximately N (logit [7 (xi)] , {nm (x5) [L — 7 (х) "). From this 
it follows that the errors є; in (7.97) are approximately N (0, [эт (xi) (1 — т НГ?) 
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so our discussion in Section 6.5 suggests that {3 can be estimated using weighted least 
squares with 
W = diag (п;т (xi) [1 — 7 (xi)]). (7.105) 


Since 7 (х;) are unknown, our argument in Section 6.5 suggests a simple approach would 
be to estimate т (x;) by r;/n; = fi, the observed relative frequency of successes for each 
covariate combination x;. The coefficients {3 can then be estimated using weighted least 
squares with weights w; = nj;f;(1— f;). Further accuracy can generally be achieved 
using iteratively reweighted least squares now using weights n; (x;) [1 — 7 (x;)] where 


ft (xi) — exp (x?) Г, [1 +ехр (x?) where Ê is the WLS estimator of 8. This can be 


continued until the estimates of 3 converge. Often, the initial WLS estimate is sufficient. 

To assess the model, such as goodness of fit, significance of coefficients, etc. One can 
proceed as indicated in Section 6.5 using the transformed model to obtain t and F tests, 
residual examination, leverage and influence diagnostics. For further details we refer the 
reader to [27, 87]. As an example of this approach for analyzing binary response data we 
consider the following model discussed briefly in [93]. 


Example 7.8 Ап experiment was performed to determine customer response to 
coupons for various levels of price reductions. For this, coupons giving 596, 1096, 15%, 
2096 and 30% price reductions were distributed to 200 families (== n;) in each category. 
To assess the response the number of coupons redeemed was recorded and the results 
are shown in Table 7.18. Letting r; = number of coupons redeemed, f; = r;/n; gives the 
relative frequency of coupon redemption for each category. The results are plotted in 
Figure 7.8 and the results suggest that a logistic model would be appropriate to model 
the probability of redemption of 7 (z;) for a price reduction of 2;%. 


Table 7.20 Coupon Redemption Data 
Category i 2; (іп %) т; т; fi 
1 5 200 32 0.160 
10 200 51 0.255 
15 200 70 0.350 
20 200 103 0.515 
30 200 148 0.740 


сл Фо bo 


The scatter plot of f; versus x; suggests that a linear model 
logit [т (z)] = Bg + 81x +E (7.106) 


is an appropriate model for the data. To do this the model was fit using a single stage 
WLS estimation with weights as given in Table 7.21. 


Table 7.21 Weights for Coupon Redemption Data 


Category i zx; (іп) ni т; fi wi-nf(1- р) 
1 5 200 32 0.160 26.88 
2 10 200 51 0.255 37.995 
3 15 200 70 0.350 45.5 
4 20 200 103 0.515 49.955 
5 30 200 148 0.740 38.48 
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Figure 7.8: Plot of f; and fitted Logistic response function 


The results of the fit are shown in Table 7.22 with the fitted equation being 
logit [т (x)| = —2.19 + 0.1092. (7.107) 


Table 7.22 Analysis of Parameter Estimates 
Predictor Coefficient S.E. Coeff. t-statistic p-value 


constant —2.18506 0.06783 —32.21 0.000 
Price 0.10870 0.00363 29.93 0.000 
Table 7.23 ANOVA Table for Coupon Redemption Data 
Source df Sum of Squares Mean Squares Е 
Regression 1 151.98 151.98 895.53 
Residual 3 0.51 0.17 

Total 4 152.49 


The ¢ values show that both 8, and 8; are highly significant. Using (7.107) the esti- 
mated values 7 (x) are shown in Table 7.24 and a plot of the residuals is given in Figure 
7.9. Overall, it appears that the model (7.107) gives an excellent representation of the 
observed data. 


Table 7.24 7 (x) for Coupon Redemption Data 


Category? х; іп 0) ni ri fi 108107 (x)| 
1 5 200 32 0.160 —1.645 
2 10 200 51 0.255 —1.10 
3 15 200 70 0.350 —0.555 
4 20 200 103 0.515 —0.01 
5 30 200 148 0.740 1.08 


7.6.1 Interpretation of Logistic Coefficients 


One of the attractive features of the logistic model is the probabilistic interpretation of 
the regression coefficients that is possible, particularly when the independent variable 
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Residual 


-150 -1.00 -0.50 0.00 050 1.00 Logit(Y) 


Figure 7.9: Scatter plot of residuals versus logit(Y) 


is categorical with two values 0-1. As we have already discussed, these are useful in 
describing treatment effects or the presence or absence of a factor, such as sex, smoking 
etc. 

From (7.106) a change of one unit in the independent variable is given by 


B; = logit [a (x; = 1)] — logit [r (x; = 0)] 

og {Т @ =1)/й -r(e = 
( (т, 70) /[E— 7 (2: = 0)]) 

Now if 7 represents the probability of an event, then п/ (1 – п) is the odds of the 


event occurring against its not occurring. Hence, logit(7) is often referred to as the log 
odds of the event. Using this terminology, the quantity 


(7.108) 


7 (zi = 1)/ [1 — 7 (zi = 1)] 


7 (a; —0)/ [1 —7 (a; = 0)] (7.109) 


is called the odds ratio and the (7.108) is then the log odds ratio. If we now exponentiate 
(7.108) we get exp (B;) = v;. For example, if x; = 1 for a treatment and т; = 0 otherwise, 
then V; represents the odds ratio for the success of the treatment against its failure. In 
many situations, where the absolute risk т of an event is small, such as getting cancer 
from an air pollutant, then 1 — 7 ~ 1 and the odds ratio 


m (zi = 1) 


vim n (zi = 0) 


(7.110) 
where 7 (т; = 1) /т (x; = 0) is called the relative risk of the event of getting cancer due 
to the presence of a pollutant (2; = 1) against its absence (x; = 0). So if y = 2, then we 
would interpret this as saying one is twice as likely as getting cancer from exposure to the 
pollutant against not being exposed. It is the possibility of making such interpretations 


which contributes to the use of the logistic model. Such quantities are reported almost 
daily in the popular media. 
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7.6.2 Maximum Likelihood Estimation 


When the sample sizes n; in (7.98) are not large (in practice they may be zero) then 
the asymptotic approach to estimation of the parameters in a logistic model may not be 
appropriate. In this case an estimation method which does not require large sample sizes 
should be considered. In keeping with our treatment of the GLM this can be done using 
maximum likelihood estimation using grouped or ungrouped data. We will consider the 
case of ungrouped data, since weighted least squares cannot be used directly in this case 
(logit(Y) , Y = 0,1 is undefined). 

Hence, we assume that we have n independent observations of a random variable Y 
with binary outcomes 0,1. Letting Y; be the outcome of the i-th observation, then Y; 
is a Bernoulli random variable with P (Y; = 1} = v (xi) = п; where х; is the covariate 
vector for the i-th observation. Then it follows from (2.143) that the likelihood function 
for Y; is 


fy, (ys) = 7# (1-0) (7.111) 
so the likelihood function for all n observations is 
L=|[ fw) = [[v* a-2^*. (7.112) 
i=1 i=1 


If we assume that т; is given by the logistic function 


exp (x7 8) 


i = ——— 7.113 
1 + exp (x7 3) ( 
then 8 can be found by setting 
OL 
— =0,0<j<m. 7.114 
83; j (7.114) 
As for normal random variables this is done more conveniently by setting 
д 
— 1081 = 0, 0 < 3 < т. 7.115 
55; 8 j (7.115) 
Now А 
С = log L= 3 ^ yilogmi + (1 у) log (1 — т;) (7.116) 
і=1 
апа 
ðL «4 8 д 
л = iz lo т; + 1 — 1 — lo 1—т; . 7.117 
Since 
On; On; Е 
—- = –т;(1— mj) and = —z4Tj(l—7,;,), 0< 7 < m 7.118 


it follows using {һе chain-rule from calculus that 


Ey ыы шаша ang 
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and 


-Y usy [| T У) (1 — ys) vij ЕЕ . (7.120) 


1—т; 
1=1 a 


Thus, ise d give 


д 3 = 
aa sy ат) йен Swen) =0 (7.121) 
i=l i=l 
and 
ac 


эк У игит) Уа-а 
2 


1=1— 


» (-yizij +237) = 0, 1€ j € m. (7.122) 
i=l 


Hence, the MLE equations for 8 are given by 


i=l i=l 


and 


TL 
» $9 (т;—у)=0,1<ў<т. (7.124) 

i=1 
Since т; depends nonlinearly on the parameters, Equations (7.121)-(7.122) represent 
m + 1 simultaneous nonlinear equations which generally must be solved numerically in 
some fashion. Usually, this is done by a Newton iteration scheme which can be shown 
to be equivalent to an iteratively reweighted least squares method analogous to that 
discussed previously for grouped data. For large sample sizes the asymptotic theory 
is essentially equivalent to the WLS estimation given in Section 6.7 [64]. So inference, 
goodness of fit and diagnostics can again be based on those of the theory of the generalized 

regression model. We refer the reader to [27, 87] for further details. 


7.7 The Generalized Linear Model 


Although the general linear model with normal errors has been shown to provide an ad- 
equate method for modeling a wide variety of statistical data, we have already seen that 
the there are many situations where the normal error model is inappropriate. For exam- 
ple, we showed in Chapter 6 that the drink delivery data was fit better by a power family 
model of the Box-Cox type and binary response data was generally better explained us- 
ing logistic regression. Other types of data clearly have non-normal error distributions. 
These include random count data which often follow a Poisson distribution and survival 
data which often have an exponential or gamma distribution. In this section we briefly 
discuss a class of models which include all of these as particular cases, the generalized 
linear model (GLIM) first introduced by Nelder and Wedderburn in 1972 [92]. 
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The basic idea is to obtain a linear predictor, x? З as a function of the mean response 
as а way of combining the simplicity of the linear model with the generality of a large- 
family of non-normal error distributions described by the exponential family of random 
variables. The basic idea is the concept of a link function which connects the mean 
response to a given error distribution. 


7.7.1 Linear Predictors and Link Functions 


Let Y;,i = 1,2,...,n represent the outcome of the i-th random observation. We assume 
that Y;, i = 1,2, ...,n belong to the same family of random variables which differ only in 


their means 
f eH OT) dm 2; m. (7.125) 


As for the GLM and logistic models, we assume that there exists a function g such that 


g (u;) = (xi, B) = x1 В. (7.126) 


where x; is an m + 1 vector of response variable and 9 is an m + 1 vector of unknown 
coefficients which have to be estimated from the data. The function g is referred to as 
the link function. For example, in the GLM g (u) = u, the identity link, while for logistic 


regression 
Ti 


) = logit (л;). (7.127) 


g (Li) = 9 (7i) = log (5 
If g is invertible, then (7.126) gives 
шщ = 9" (xi 8). (7.128) 


For the СІМ, 97! is the identity, while for logistic regression 


91 (т) = wey (7.129) 
so T8) 
(уш. Р (7.130) 


EE + exp (x78) | 


As we have observed in dealing with binary and binomial response data, link functions 
may often be taken as inverse cdfs. For example, 


(a) the probit link, 
g (u) = 9^ (и); (7.131) 


where Ф is the cdf of a N (0,1) random variable; 


(b) the complementary log-log link, 
g (и) = log [log (1 — ш)|; (7.132) 


and 
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(c) the power family, 


А 
= Hs À # 0, 
gs. ED (739) 


Other link functions, so-called canonical links are given in Table 7.23. 


Table 7.23 Canonical Links and Error Functions in GLIM 


Distribution of Canonical Name of Regression Model 
Error Function Link Function n; Link (Inverse Link Function) 
(1) Normal 7; = ш; Identity E(Y)-xTg8 

n Ti т exp (x7) 
(2) Binomial n; = log (=) Logistic Е(Ү)= Герт) 
(3) Poisson n; = log (А) Logarithm Е (Y) = exp (x? 8) 
(4) Exponential n; = 1/А; Reciprocal E(Y) = (x7 )~* 


7.7.2 The Error Function 


The question now arises as to what is the relation between the link function and the 
underlying distribution of Y;. In practice, what is the role of the canonical links given in 
Table 7.23. This is best explained through the use of the exponential family. 

Recall from Eq. (2.37) that Y is said to belong to the exponential family if 


fv (y, 9) = exp [a (x) b (0) + c (0) + d (x)]. (7.134) 


As we has already seen, normal, binomial and Poisson random variables are all members 
of the exponential family and when a (z) = т and b (0) = v, then (7.134) can be written 
in the canonical form 

fx (x, V) = exp [zy +0 (ф) + d (z)] (7.135) 
If b (0) is a function of E (Y) = p, then letting 


b(u) = = (8,х) = x' B 


b is the canonical link. We examine this for a member of cases given in Table 7.23. 
First, if Y ~ N (и, c?) , then 


f (y, H, a?) = A exp l-z (z i uw? 


1 2 
= Jeu? [usi 


Some simple algebra shows that the canonical link 


5 (и) = џи. (7.137) 
If Y is a Bernoulli random variable, then 
fap = pap" 


= exp(zlogp)exp[(1 — x) log (1 — p)] 
exp [z log p — z log (1 — р) + log (1 — p)| 
= expizlog|[p/ (1 — p)] + log (1 — p)} . (7.138) 
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Since E (Y) = р = и, then (7.138) can be written in the form 


f (x, и) = exp [z log [u/ (1 — ш) + log (1 — 4]. (7.139) 


Letting 
b (u) = log[p/ (1 — p)] = v = (8. x) (7.140) 
we arrive at the link function for the logistic model described in the previous two sections. 
Last, we consider Y having an exponential distribution. Then the density of this 
given by 


f(z,u) = w'exp(-z/u) 
= ехр(-х/и — logy), (7.141) 
where E (Y) = и. Hence, the canonical link 
b(u) = –1/и = № = (B,x). (7.142) 


So the canonical link is —1/p. Since the sign is irrelevant, we can take the canonical link 
as 


b(u) —l/u (7.143) 


as given in Table 7.23. The remaining entries in the table can be obtained in the same 
way. 
7.7.3 Parameter Estimation 


As for the GLM and logistic model, the basic problem in GLIM modeling is the estimation 
of the parameters in the relation. 


pi 97 (х, B)) 2 g^! (x1) (7.144) 


where x;,? = 1,2,...,n are n observations of the m predictor variables. As for the GLM 
and logistic regression this is most commonly done by maximum likelihood estimation. 
Writing the density of Y; in canonical form 


f (yi, B) = exp [yi (х; 8) + с (x1 8) + d (yi) (7.145) 


so that 
log [f (yi, 8)] = yi^! (x18) + с (x1 8) + d(y). (7.146) 
Then the log-likelihood function £ of the n observations y;,2 = 1,2, ...,n is given by 


L= У [yb (xi 8) + с (xi B) + а (и) (7.147) 
The MLE of 8 is then obtained by setting 
aq = 9 (7.148) 


This generally leads to a set of nonlinear equations for 8 which usually must be solved 
by iteration [27, 87]. If Newton's method is used as the numerical iterative method it can 
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be shown that this can be done by solving a sequence of weighted least squares equations 
of the form 
X'WXgB = X! Wz, (7.149) 


where X =[z;;],1 € i € n,1 € j € m is the design matrix and W and z generally 
depend on 8. Details can be found in [85, 25, 52]. 
Inferences concerning the model's correctness can be made via generalized likelihood 


ratio tests. For example, if £ (8) is the maximized log-likelihood with respect to 8 
and С (й) is the log-likelihood for the null model we then can obtain the goodness of fit 


measure the deviance А (8) a [c ( 8) f (а) (7.150) 


which asymptotically has a x?-distribution with n — m — 1 degrees of freedom under the 
null hypothesis Ho : the model being fit is correct. Hence, large values of D are taken to 


be indicative of a poor fit for the (x, 9) model. If D (ô) > x? (n — m — 1,0) we conclude 


there is a lack of fit at the a level of significance and reject Но. We note that differences 
in deviances may be used in а way analogous to the extra sums of squares in the GLM. 
For more details we refer readers to [85, 25, 52]. 


7.8 Exercises 


7.1 A set of data [Journal of Pharmaceutical Sciences (1991) 80, pp. 971-977] was 
obtained on the observed mole fraction solubility of a solute at a constant tem- 
perature. The response Y is the negative logarithm of the mole fraction solubility, 
along with 


x, = dispersion partial solubility, 
£2 = dipolar partial solubility, 
x3 = hydrogen bonding Hansen partial solubility. 


Answer the following questions using the data in Table 7.24. 


Table 7.24 Mole Fraction Solubility Data 


1 
2 
3 
4 
5 
6 
7 
8 
9 
10 
11 
12 
13 
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7.2 


7.3 


(a) Fit a complete second-degree polynomial model to the data. 


(b) Test for significance of the regression, and construct t statistics for each model 
parameter. Take a = 0.1. Interpret these results. 


(c) Plot the residuals and comment on model adequacy. 
(d) Use the extra-sum-of-squares method to test the contribution of all second-order 
terms to the model. Take o — 0.05. 


A scientist collected experimental data [90] on the radius of a propellant grain (y) 
as a function of powder temperature, %1, extrusion rate, x2, and die temperature, 
Z3. 


Grain Powder Extrusion Die 
Radius (y) Temperature (x1) Rate (x2) Temperature (=з) 
82 150 12 220 
93 190 12 220 
114 150 24 220 
124 150 12 250 
111 190 24 220 
129 190 12 250 
157 150 24 250 
164 190 24 250 


(a) Consider the multiple linear regression model with centered regressors 


yi = Bo + By (xii — X1) + Ba (1i — Fa) + B5 (xai — ©з) + €i. 


Write the observation vector y, the design matrix X, and the parameter vector 
in the model 
у= ХВ + e. 


(b) Write out the normal equations for least-squares estimation 
(XTX) В = ХТу. 


(с) What characteristic in this experiment do you suppose produces the special 
form of XTX in (b)? 

(d) Estimate the regression coefficients in the model in (a). 

(e) Test the hypothesis Ho : 8, = 0 and 8, = 0. 

(f) Find the hat matrix H. 

(h) Find the VIFs (variance inflation factors) of the coefficients Ê}, 05, and 4. Do 


you have any explanation as to why these measures of damage due to collinearity 
give the results that they do? 


Bars of soap are scored for their appearance in a manufacturing operation. These 
Scores are on а 1-10 scale, and the higher the score the better. The difference be- 
tween operator performance and the speed of the manufacturing line are believed to 
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measurably affect the quality of the appearance. The following data were collected 
on this problem: 


Operator Line Appearance 
No. Speed (Sum for 30 Bars) 
1 150 255 
1 175 246 
1 200 249 
2 150 260 
2 175 228 
2 200 231 
3 150 265 
3 175 247 
3 200 206 


(a) Using dummy variables, specify the design and parameter matrices. 
(b) Fit a multiple regression model to these data. 


(c) Construct the ANOVA table. Using a = 0.05, determine whether operator 
differences are important in bar appearance. 


(d) Does line speed affect appearance? Take a = 0.05. 


(e) Using the regression model, show that the average appearance score for operator 
#1 is 250, operator #2 is 238, and operator #3 is 256. 


(f) Plot the residuals. What model would you use to predict bar appearance? 


7.4 А set of 5 pairs of observations were obtained as below. Using the method of 
orthogonal polynomials described in Section 7.2 answer the questions below. 


y (index) | 98 11.0 13.2 15.1 16.0 
x (year) | 1980 1981 1982 1983 1984 


(a) Fit a third-degree equation to the above data. 
(b) Test the hypothesis that a second-degree equation is adequate. Take o — 0.05. 
7.5 An experimenter suggests the following dummy variable schemes to separate how 
possible levels depend upon groups. Are they permissible? 
Zo 4, 42 23 


(a) (b) 1 1 0 о 
ee - 1 0 1 0 
0 0 3 


1 0 0 1 


Z Za Zs Za Z Uu. Eu 7з. 4 
1 -1 -1 -1 
1 -1 -1 1 
-1 -1 1 1 
-1 1 -1 -1 


Re ep 
| 
ке 
| 
. 
[uw] 
| 
ҥе 
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7.6 


Consider the data for Exercise 6.3. 
(a) Fit a model of the form 


= fl + bixi + 8,122 + Ei. 


(b) Compute the ordinary residuals, the PRESS residuals, and the sum of the 
absolute PRESS residuals. 


7.7 An experiment was conducted in the civil engineering department at Virginia Poly- 


7.8 


technic Institute and State University in 1988 to investigate the growth of a certain 
type of algae in water. A set of observations was obtained as a function of time, 
denoted by x2 (days), the dosage (in mg) of copper added to the water, denoted by 
х1, and y denotes the units of algae observed. The data are shown in Table 7.25. 


Table 7.25 Algae Growth and Copper Data 


y Тү T2 y Тү T2 y Тү T2 y Ti T2 
.30 1 5 .37 1 12 .23 1 18 .36 1 25 
‚84 1 5 .36 1 12 323 1 18 .36 1 25 
.20 2 9 .30 2 12 .28 2 18 .24 2 25 
.24 2 5 ol 2 12 ‚27 2 18 ‚27 2 25 
‚24 2 5 .30 2 12 ‚25 2 18 ‚31 2 25 
‚28 3 5 .30 3 12 ‚27 3 18 .26 3 25 
.20 3 5 .30 9 12 .25 3 18 .26 3 295 
.24 3 5 .30 3 12 ‚25 3 18 .28 3 25 
02 4 5 .14 4 12 .06 4 18 .14 4 25 
.02 4 5 14 4 12 .10 4 18 ‚11 4 25 
.06 4 5 14 4 12 .10 4 18 ‚11 4 25 

0 5 5 14 5 12 .02 5 18 .04 5 25 

0 5 5 15 5 12 ‚02 5 18 .07 5 25 

0 9 5 ‚15 5 12 .02 5 18 .05 5 25 


(a) Fit the data to the model 
Yi = Bo + 8,21: + 852% + 812212 + €i. 


(b) Test the interaction term: Ho : 8, = 0 versus Но: 31. Æ 0. 

(c) Make a test for the lack-of-fit and draw a conclusion. 

(d) Draw the scatter plots of the residuals for the fitted model against ту and z2 
separately. Give a comment on each of them. 


Suppose that you have two sets of data with pairs of values (x,y). You consider 
the model 
Y = fy + 8,4 + 8,122 + Z (yo + түүх +7112") +e, 


using a dummy variable Z whose value is —1 for set A and 1 for set B because you 
are not sure whether to fit the data separately or together. 


(a) Set up a hypothesis for the case of fitting a single quadratic model for all the 
data. 


(b) Set up a hypothesis for the case of fitting a single linear model for all the data. 


(c) How would you obtain separate quadratic fits to the two data sets? 
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7.9 Consider the data below [90]. Varying numbers of fabric type specimens were 


exposed to load (=) in lb/in?. Also listed is the number of specimens that failed. 


Load (х) No. Specimens Мо. of Failures 


5 600 13 
35 500 95 
70 600 189 
80 300 95 
90 300 130 


(a) Fit the data using a logistic model. 
(b) Find the maximum likelihood estimates of 80, 8}. 


(c) Using the fitted values compare the two models: the linear model and the 
logistic model. 


7.10 In an experiment testing the effect of a toxic substance, 1,500 experimental insects 


were divided randomly into six groups of 250 each. After the insects in each group 
were exposed to a fixed dose of the toxic substance, the number of dead insects was 
counted. The results are shown below where log(z;) denotes the dose level on a 
logarithmic scale in group j and E; represents the number of dead insects in the 
j-th group. Assume that the logistic response function is appropriate [93]. 


1 2 3 4 5 6 

1 2 3 4 5 6 
200 250 250 250 250 250 
28 53 93 126 172 197 


(a) Find the maximum likelihood estimates of 8), 81. Write down the fitted logistic 
function. 


(b) Plot the fitted response function and the estimated proportions p; on the same 
graph. Give a comment. 

(c) What is the estimated probability that an insect dies when the dose level is 
= = 3.5? 


(d) Find the estimated median lethal dose - that is, the dose for which 50 percent 
of the experimental insects are expected to die. 


7.11 Indicate how you would use the method of least squares to fit a model of the form 


у = Yr" 


where both у and у, are unknown. 


7.12 The population (in millions) of the United States from 1790 to 1970 is given in 


Table 7.26: 


7.8. EXERCISES 357 


Table 7.26 Population of the U.S. 1790-1970 


Year Population Year Population 
1790 3.929 1890 62.9 
1800 5.308 1900 76.0 
1810 7.240 1910 92.0 
1820 9.640 1920 105.7 
1830 12.870 1930 122.8 
1840 17.070 1940 131.7 
1850 23.190 1950 151.3 
1860 31.440 1960 179.3 
1870 39.820 1970 203.2 
1880 50.160 


(a) Fit a model of the form у = 6) exp(G,z) to these data using the method of 
least squares. 


(b) Use your model found in (a) to predict the population in 1980 and compare it 
with the true value. 


(c) Use only the data from 1790 to 1900 to fit the model in (a). 


(d) If you were a demographer in 1900 and were asked to predict the population 
for the next century on the basis of the model in (c), do you think you would have 
done a good job? (Since an exponential model implies a constant growth rate, such 
models tend to be inappropriate over long periods of time. A more realistic model 
is the logistic regression model.) 


(e) Apply the logistic regression model to these data. 


7.13 Consider a regression model 
Ү=ХЗ+є 


where = is N (0, с21), со is known and X = [X; : X2] where X; is n x r and X2 
is n x (p — т). Suppose one wishes to test 


Но: 8, = 0 versus Ну: 8; #0. 
(a) Consider a likelihood ratio test for these hypotheses. Write out the statistic 


d (8,185) = —21og |£ (85) /L (B) | 


where L (8.) is the maximized likelihood under the restricted model and L (8) 
is the maximized likelihood under the full model. 


(b) How is this statistic related to the difference in the SSE (residual sum of 
squares) for the full and reduced models? 


7.14 Suppose that the log-link is used, producing the model 


Ш; = Е (Ү;) = ехр (х} B) ‚® = 1,2, ...,n. 
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Show that the maximum likelihood procedure for estimation of [3 results in the 
solution to 


XT&-0 
where & is the residual vector 
TI 
Y2 — H2 
ê= | 
Yn — Ё, 


(Hint: The estimation is given as the solution to 8 in 


XT (y — 9) = 0 where 0; = exp (x18) : 


In other words, you are required to find the solution to В of У", (y; — ĝi) = 0. 
So you need merely to maximize the log likelihood for this case and show that this 
leads you to the above.] 


7.15 Refer to Table 7.23, for canonical links and error functions in GLIM. 
(a) Write the gamma density in the form of exponential family, 


f (yi: 9s) = exp [a (yi) b (8:1) + g (6:) + R (0:)]. 


(b) Show how the canonical link function (and thus the regression model) is devel- 
oped for the case of the gamma distribution. 


[Hint: Using the result in (a), 9(0;) and 1. Then write 0; = x1, and thus 
xi 8 = (uu). 


7.16 Suppose that a nonlinear regression model y = f (0,2) + = is considered with 
f(x) = exp (02) . 


The following observations are given: 


Obs. No. cx y 
1 1 2.713 
2 2 3.025 
3 3 11.731 


(a) Write down the normal equations of the least squares method to estimate the 
parameter @. 


(b) Using an iterative method, determine the value of the estimator 015, in (a). 
(c) Find the MLE of @ with є ~ N (0,07). Does дмк coincide with 6,5, in (a)? 


Chapter 8 


Selection of a Regression 
Model 


8.1 Introduction 


In the preceding chapters, we have discussed how to test for the significance of the 
postulated model through the (i) the lack of fit test (ii) examination of residuals and 
(iii) checking of normality assumptions. These exploratory techniques often lead us to 
alterations in the initial model such as transformations of the data or further regression 
techniques. However, in many practical situations, we will be looking for the most 
appropriate subset of regressors or predictor variables that should be used in the model. 
Some aspects of this issue have already been examined in Chapters 5 and 6 using the 
above mentioned exploratory methods. This is called the variable selection problem or 
equivalently selecting the best subset. Hence, the main purpose of this chapter is to 
provide methods and criteria for doing this (57, 90, 113, 114]. 

Using a model different from the true one is called model misspecification. As we shall 
see, there are two major consequences of model misspecification: 


1. If some variables are omitted in the model, then the estimates of the remaining 
variables are biased. 


2. If there are too many variables, then in general the variances of the remaining 
variables become large. 


In some sense, model selection is a trade-off between biasedness and precision. There- 
fore, the main idea for selecting an appropriate/best regression model is to find a com- 
promise between two conflicting criteria: (a) reliability in prediction and (b) parsimony 
(simplicity) of model specification from both the economical and practical points of view. 
Since the term “best” is somewhat subjective, the ideal model should include the fewest 
number of regressors that permit an adequate prediction (or interpretation) of the re- 
sponses. Usually, a unique best regression model does not exist, nor is there a unique 
statistical procedure for choosing the subset; certain professional or personal judgement 
is needed for all the methods described in this chapter. For instance, if two regressor 
variables are highly correlated with Y (response variable) and highly correlated with each 
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other, then it is often sufficient to include just one of the regressors in the model. The 
choice of which regressor variable to include may depend, for example, on which variable 
is easier to measure or cheaper to obtain. 

Thus, in this chapter we shall examine in detail both criteria functions and compu- 
tational techniques for finding the best subset in a regression model. 


8.2 Consequences of Model Mispecification 


Before examining a number of criteria for model selection, we discuss a number of con- 
sequences of omitting variables from the true, but generally unknown full model. These 
will further motivate various methods for model selection. We first consider the effects 
on the estimates of the regression coefficients . 

Again we suppose the full model can be written as in (5.16) 


Y — X8 4 e. (8.1) 
Assume now that / is partitioned as 


B = (By, B2)” (8.2) 


where 8; = (Bud B e (B osi uos: A mus В; is the vector of coefficients 
in the reduced model and 8, is the vector of variables that are deleted (note that this 
may require a permutation of the regressiors). Also note that dim(/9i) = p + 1 and 
dim (85) 2 m -- 1— (p+ 1) = m — p. Then X can be partitioned conformally as 


X = [X [X5] (8.3) 
so that (8.1) can be written as 
Y = X18, + Xo, +e (8.4) 
and the reduced model is given by 
Y = Xf, +E. (8.5) 


From (5.22b) the least squares estimate of 3 is given by 
Ê = (XTX) ` XTY (8.6) 
and similarly the least squares estimate of 8 is given by 
Ê, = (XTX,) ` XTY. (8.7) 


Recall from Chapter 5 that B is an unbiased estimate of B. However, B, is not an 
unbiased estimate of G,. To see this note that 


E (8.) -(XTX;) ! XTE (Y). (8.8) 
From (8.4) E (Y) = X40, + X3, and using this in (8.7) gives 
E(B,) = (XIX) XT (X121 + Xo;) 
(XTX)  XTX48, + (XTX1) ' X7 X58, (8.9) 


8.3. CRITERIA FUNCTIONS 361 


where A = (XTX) XTX; is called the alias or bias matriz. In general, A Æ 0 so 
that E (A: ) + Bı. Hence, generally B, is biased. However, in the special case where 


the columns of X are orthogonal, then XT X; = 0, and A, is unbiased. 
The effect on the variance of 3, is somewhat more complicated and we refer the 
reader to the paper of Hocking [57] for full details. Recall again from Theorem 5.4 that 


x (8) =o? (XTX) (8.10) 


and _ 25 
У (8) = 0? (KT) 

It can be shown that if B; is the least squares estimator of 3, in 8 =(@,|8,)" from the 

full model, that 


(8.11) 


x (21) - x (8.) (8.12) 

is positive semi-definite so that the variances of B, are at least as large as those of B. 

In general, the variances of B, will be larger than those of B,. Hence, deleting variables 
increases the precision of the estimates of (Bo, Bies b A ae 

It is also useful to examine the effect of deleting адра on the predicted values. 


Letting Y denote those from the full model and Y* those from the reduced model, it 
follows from (5.114a) that 


Y-HY (8.13) 
and . 
Ў; = НҮ, (8.14) 
where Н = X (ХТХ) XT and H; = X, (XTX4) ` XT. 
Непсе, 
Е ($1) = H,E(Y) 


= Hy, (X18, + X58;) 
= Н.Х.8, + Hi X2f, 


—1 
= Х,8,+Х,(Хїх) XIX, 
= X18, + X,AB, 


So, Y, is generally a biased estimate of Y unless the regression variables are Sribogonet 
Similar result exists for the variances of Y* as for д1. 


8.3 Criteria Functions 


Before describing various algorithms for choosing the best subset (or best subsets) we 
examine a number of criteria for evaluating competing models. 

Here we introduce some of the useful ones which are frequently used in selection 
procedures. Basically there are two different types of criteria functions depending on 
their usage: One is called a selction function which is a statistic which can be used 
to choose a specific model. The second type of criteria function is called an assessment 
function, which is used for assessing the performance of the model in terms of its intended 
use. 


362 CHAPTER 8. SELECTION OF A REGRESSION MODEL 


8.3.1 Coefficient of Multiple Determination R? 


Assume that the full model contains T variables (T for total) including the intercept and 
we have a subset of p variables including the intercept. Then, as in Chapter 5 we can 
define R?, the Д2 value for the p-variable model by 


2 
ДО ы (5 d sla (8.15) 
"CEST XS ey SST | 


where SSR, and SSE, denote the regression sum of squares and the residual sum of 
squares, respectively. It should be noted that since the corrected total sum of squares, 
SST, is constant for all regression models, R? cannot decrease as additional independent 
variables are introduced into the regression model. Thus the maximum value of R2 will 
be attained when all possible variables are included in the model. With this in mind. our 
goal in utilizing R2 is to compare alternative models so that we may determine when the 
introduction of additional regressor variables does not contribute а substantial increase 
іп R2. To use RŽ as an assessment criterion we note the following properties of R?. 


Properties of R? 
(1) Increasing the number of regressors will increase the coefficient of determination = 


(2) For a given residual mean square, s? = SSE,/(n—>p), denoted by RM Sp, th 
magnitude of R? depends on the magnitude of the regression coefficients. That is, 
R2 is not scale invariant. 


As we explained above, since we use the same set of data for different regression 
models SST is the same for each regression. This means that the R? statistic is not an 
absolute measure but a relative measure of goodness-of-fit. Furthermore, introducing an 
additional regressor increases R?, so it is not just a matter of finding the subset with 
maximum R? but rather that of finding a suitable subset with a high R?. To overcome 
some of these difficulties a modification of R?, the adjusted coefficient of determination 


E, defined by in Chapter 5 


=, pou ae Вр) 
RMS, 


is often used instead of R2. 
From (8.16), we see that maximizing R, is equivalent to minimizing the residual 


mean squares (RM S,). A criticism of both А2 and Т5 is that neither incorporates 
into the decision/selection criterion any consideration of the effects of incorrect model 
specification. 

If a large number of subsets are examined, then a graphical technique can be used 
to examine R2. Plotting R? versus p generally results in a curve of the from in Figure 
8.1. Values of "R2 near the "knee" of the curve suggest that models with those number 
of variables are good models. 
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1.0 knee 


Figure 8.1: Plot of RŽ versus р 


Similarly, a plot of SSE, versus p generally looks like Figure 8.2. Values of SSE, 
near the bottom of the curve suggest good models as measured by R? or SS Ep. 


SSEp 


Figure 8.2: Plot of SSE, versus p 


8.3.2 Mallows' С, 


Since Re essentially measures only the effective of bias, it is perhaps somewhat better to 
have a criterion which takes into account both precision and bias. Since a subset model 
is biased, it is appropriate to look for models which minimize E (SSE,), the expected 
mean square error. 

Let Y, be the vector of predicted values from the p-variable model. If Y is the vector 
of observations, then 


MSE, = E(SSE,) 


^ 


= E (v ыу 4) 


Е [c nd (v-1,)]. (8.17) 


Recalling from (5.114а) that Y, = H,Y where Н, is the hat matrix for the p-variable 
model, then, 


(x = a (v E x = YT (In  H,) Y. (8.18) 
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Hence, 
MSE, = E [Ү? (I, - Hp) Y]. (8.19) 


Since this is quadratic form in Y we use Theorem 4.14 to get (i.e., A = I, — Hp) 
MSE, = (E(Y), (In — Hp) E(Y)) + tr [In - Hp) £ (Y) (In — H5]. (8.20) 
Since E (Y) = X8, X (Y) = 021,, and (I, — Hp)? = (In — H,), (8.20) becomes 
MSE, = (XB, (In — Hp) X8) + o?tr (In — H;). (8.21) 
Because Н, has rank p, tr(I, — Hp) = n — p so that 


(XA, (In — Hp) XB) + 0° (n — p) 
SSEg + c? (n — p) (8.22) 


MSE, 


li 


where S.S Eg is the bias term, which is zero if p = Т. From this, a reasonable assessment 

criterion would be to choose p-variable models which minimize E (SS Ep). 
Unfortunately, since this depends on the unknown parameter vector 8, this is not 

immediately useable. To overcome this difficulty Mallows (1964, 1966, 1973) considered 


the closely related statistic au 
E 
C= E P+2p—n (8.23) 


where 5? is the MSE from the full model with T variables. From (8.23) we have 


B(C) =E (S008) +2р-п. (8.24) 


From above, a model with small bias has E (SSE,) = (n — p) с? and assuming s? ~ о? 


then 


а? (n — 
Е(Съ) = свв + 2р – п = р. (8.25) 


From this discussion assuming that models with small bias are desirable, it is rea- 
sonable to select models with “small” С. In fact models which minimize C, such that 
Cp p. 

Again, if there is a large number of variables it is convenient to plot Cp versus р 
superimposed over the line C, = p as shown in Figure 8.3. 

Points close to the line indicate good models as measured by Cp. An additional form 
for C, is given by 


Cy = SEES iid +9р—=1, (8.26) 
In fact, 
SSE,—SSEr _ SSE, — SSEr (8.27) 
s? — SSEr/ (n — T) ` 
so that (8.27) becomes 
SSE, — SSET SSE, – SSET 
ICH OU VPN aa — — = —————— 2 — = C. . 
oer uum) RU RT о оен дш) 
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к 4 © со 


Figure 8.3: Plot of C, versus р 


Since (SSE, — SSEr) /s? = Fp is the F-statistic for testing the significance of the p- 
variable model, them it follows from (8.28) that 
OG, = (P -p)(F,-1) +p. (8.29) 


Again, for the full model it follows from (8.29) that Cr = Т. 

It may be, however, that the full model is poorly specified and that the resulting 
mean square error is inflated. In such cases, the value of Cp can be negative. This is not 
to say that the model with the smallest C, is poor; it merely states that the full model 
is poorly specified. 


Properties of C, 


(1) C, depends only on the usual regression calculations, namely SSE,, $°, p, and n. 
This is the basis for the use of C, in fast all possible regressions calculations. 


(2) Cp measures the difference in fitting errors between the full and subset models. 


(3) C, consists of two parts: a random part F, and a penalty p. This means that there 
is a trade-off between decreasing Fp and adding variables. (Due to this the Cp 
criterion is sometimes referred to as a penalized method of choosing a model.) 


(4) C, is closely related to the adjusted coefficient of determination R, [71]. Since 


—2 n A . ^2 SSET 
1— R= cp (1 — А2), and estimating o? by 6° = тт; we have 
(n — T) SSE, 
= + 1 20— 8. 
C» SSE; +2p-n (8.30) 
or 


C,-p (n-T)SSE, 1-R, 
n-p  (n—p)SSEr |. gl 


where Ry is the adjusted coefficient of determination for a model containing all Т 
parameters. 


366 CHAPTER 8. SELECTION ОЕ A REGRESSION MODEL 


One disadvantage of C, is that it seems to be necessary to calculate С» for all, 
or most, of the possible subsets, to allow interpretation. For other illustrations and 
comments and further examples of the use of C,, the reader is referred to Gorman and 
Toman [44], Mallows [82], or Daniel and Wood [22]. 


8.3.3 The PRESS Statistic 


So far we have discussed selection criteria based on assessing the fit. However, if predic- 
tion is an important consideration we might consider selction criteria which asseses the 
predictive properties of the model. For example, as we shall see in Example 8.1 assess- 
ment criteria such as R2, R, or Cp can often provide a number of different models with 
similar values of these criteria. For further selection one can consider how well each of 
these models predicts at new points and then one can further narrow our selection based 
on this property. However, as noted previously in Chapter 6, generally we will not know 
the true value of the model at the new data points, but this can be overcome to some 
extent by using cross-validation. That is, we omit one (sometimes more) data point, refit 
and then predict using the model with the i-th observation deleted to predict the value 
at this point. As in Chapter 6, let §(_;) be this predicted value and then the i-th PRESS 
residual is given by (i) = yi — (1). In [2] Allen and Stone in [111] proposed using the 
PRESS statistic z 
PRESS = M 8j (8.31) 
1=1 
as a measure of the predictive power of the model. 
As noted in Section 6.4 


(8.32) 


where ё; represents the i-th residual from the full model and Aj; is the 7-th diagonal 
element of the hat matrix H. Hence, 


n А 2 
PRESS = ( ur ) (8.33) 
1 


which can be computed without having to refit. Letting PRESS, be this value for a 
p-variable model we have 


n ^ 2 
PRESS, = У) (>) (8.34) 
ip \1 — Мер 


where é;,, is the i-th residual from the p-variable model and Юу is the i-th diagonal 


element of = 
H,X4(X2X. XI. (8.35) 


Letting D, = diag[L, — H,], PRESS, can be written as 
PRESS, 218, D; 76) « £ D;*6; (8.36) 


"c. 
where ê, = (v —Y,} is the residual vector. It is suggested that the best predictive 
model is one that minimizes PRESS,. 
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8.3.4 Standardized Residual Sum of Squares 


In [102] Schmidt suggested a criterion function for assessment that is closely related to 
PRESS,, which is called the standardized residual sum of squares (RSS*), defined by 


RSS; = êp D; !&, (8.37) 


where ё„ and D, are the same as in equation (8.36). 

Intuitively the true model minimizes E (RSS*). Inspection of (8.36) and (8.37) shows 
that they are both weighted sums of squares of the residuals and, hence, direct comparison 
of these two functions with those previously discussed is difficult. However, it would 
appear that for large samples both (8.36) and (8.37) would be close to RSS, because 
one can expect that 2; would be approximately equal to the estimator $(. ;). 


8.3.5 Other Criteria 


Besides the above criteria functions, many other criteria have been suggested for selection 
procedures. We describe some of them briefly. 


(1) Tukey's rule 


The rule says to choose the set of regressors which yields a minimum of s?/v, where s? 


is the mean square error and v == n — p is the degrees of freedom of the model chosen. 
See Anscombe and Tucky [4] for more details. 


(2) Average Estimated Variance 


The average estimated variance (AEV) criterion was suggested by Helms [53]. The key 
idea is to average the prediction variance over the whole regression region of interest, 
rather than just for the data points given, using a weight function which attaches more 
weight to the more important points in the region. In the special case when the moment 
matrix, M = (XTX)/n, Helms ([53], p. 265) explored the monotonic relationship 


between 
p: RSS, 
n (n — p) 
and C, for subsets with p variables. Helms ([53], p. 269) also questioned the practice 
of always including an intercept term y in the model by saying that “our experience 
has indicated intercept terms are frequently primary contributors to variance but their 
absence often leads to only small contributions to bias." 

In summary, although many selection criteria have been developed, there is no general 
theory as to which one(s) to use in a given situation. For further information we recom- 
mend seeing Hocking [57], Seber [104] and Thompson [113, 114]. Further approaches are 
given below. 


AEV — (8.38) 


8.4 Various Methods for Model Selection 


In this section we discuss various computational techniques for model selection. In [57] 
Hocking pointed out that three distinct ingredients of techniques for model selection 
procedures in multiple regression analysis can be identified: 
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(a) the computational techniques used to identify a set of possible models to be consid- 
ered; 


(b) the criterion function used to select a particular model; 


(c) the estimation of parameters in the chosen model. 


Usually all three of these are put together but we shall discuss them separately. On 
estimation, several authors (Rencher and Pun [98], Copas [21]) have pointed out that if 
least squares estimation is used in any situation where the same data is used for selecting 
the model as for estimating the parameters, and there is competition between models, 
then the least squares estimators are biased. 

According to Miller [86] and Berk [7] this bias can be substantial. Some comments 
on how to cope with the bias are made later. For the rest of this section we consider 
computational techniques of the stepwise type, and the results of these are sets of possible 
models to which we need to apply a selection criterion. 


8.4.1 Evaluating All Possible Regressions 


In selecting the best regression model, the first approach we consider is to evaluate all 
possible linear regression models. The procedure requires the fitting of every possible re- 
gression model which includes the intercept 6) and any number of the regressor variables 
z,,0 = 1,2,.., T. Therefore there are 27 total potential models to be considered. For 
the assessment of all possible linear regression models we shall focus upon the following 
criteria. 


(a) the R? statistic from the least squares fit; 
(b) s?, the residual mean square, and 
(c) the C, statistic. 


Example 8.1 (Hald Cement data) Let us reconsider the Hald cement data (n — 13) 
that was used in Example 5.16. We will use these data to illustrate the “all possible 
regressions" approach to variable selection. Since the number of variables under consid- 
eration is four, there will be a total of 2* — 16 possible regression models that include the 
term бду. The value of т indicates the number of regressors іп the model (p = m + 1). 
The results of fitting these 16 models are shown in Table 8.1. 

We first evaluate the subset models using the R? criterion from Table 8.1 and Figure 
8.4. We observe that there is great similarity among the R? values for the various 
regression models. For the two-parameter model (р = 2; 8 and one regressor) the best 
fit clearly occurs when z4 is included (R2 = 0.6746) and next is when т» is included 
(R = 0.6663). For three-parameter models, x1, £2 and z1, х4 show relatively higher R? 
values. In addition, for the four-parameter models, z1, 22,23 and =], £2, %4 show higher 
ones. Finally, with all five parameters considered, R2 = 0.9824. From Figure 8.5 we 
can see that R2 steadily rises until three parameters are included in the model and then 
does not exhibit any substantial increase as additional paramters are added. So as we 
see, the best two-parameter model (21,22), has 0.9744 while the best three-parameter 
model (21,29,24) has 0.9765. Therefore, in the interest of simplicity, the researcher 
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would decide to select the regression model which contains only x1, 22 since inclusion of 
x4 leads to а negligible increase in R5. 


Table 8.1 Summary of All Possible Regressions for Hald data 


73.8145 0.9728 0.9638 8.2017 7.34 
47.8636 0.9824 0.9736 5.9829 5.00 


15 T2, 23, T4 
16 T1, T2, T3, T4 


Model Regressors 2 -52 MSE, 
No. in model ол Re Ry =y Cp 
1 None (69) 0  2715.7635 0 0 226.3196 442.92 
2 Ly 1 1265.6867 0.5340 0.4916 115.0624 202.55 
3 T2 1 906.3363 0.6663 0.6359 82.3942 142.49 
4 T3 1 1939.4005 0.2859 0.2210 176.3092 315.16 
5 24 1 883.8669 0.6746 0.6450 80.3515 138.73 
б 11,22 2 57.9045 0.9787 0.9744 5.7904 2.68 
Т 2,23 2 1227.0721 0.5482 0.4578 122.7073 198.10 
8 21,24 2 74.7621 0.9725 0.9670 7.4762 5.50 
9 T2, T3 2 415.4427 0.8470 0.8164 41.5443 6244 
10 T2, T4 2 868.8801 0.6801 0.6161 86.8880 138.23 
11 T3, T4 2 175.7380 0.9353 0.9224 17.5738 22.37 
12 11,22,23 З 48.1106 0.9823 0.9764 5.3456 3.04 
13 T1, T2, T4 3 47.9727 0.9823 0.9765 5.3303 3.02 
14 11,13,24 З 50.8361 0.9813 0.9750 5.6485 3.50 
З 
4 


Also, we calculate the simple correlations among the four regressor variables and the 


response. 
Table 8.2 Correlation Matrix for Hald’s data 


y Ti 22 T3 T4 
1.0 
0.731 1.0 
0.816 0.299 1.0 


—0.535 —0.824  —0.139 1.0 
—0.821  —0.245  —0.973 0.030 1.0 


However, examining all possible regressions easily becomes a laborious and compu- 
tationally burdensome task if a large number of regressors are under consideration. For 
example, when the number of the independent variables T — 10, there will be 1,024 
possible models to consider! Thus, without examining all possible models, other search 
procedures have been developed in order to find the “best” subset of variables by adding 
or removing variables one at a time. These methods are generally called stepwise proce- 
dures, which can be categorized according to the method of adding or removing variables: 
(1) backward elimination, (2) forward selection, and (3) stepwise regression which is a 
combinational of procedures (1) and (2). We now describe these in detail. 


8.4.2 Backward Elimination 


The backward elimination procedure starts with the full model and removes one variable 
at a time without adding variables. One includes all T possible regressor variables, and 
attempts to eliminate them from the model one at a time until no removal occurs. Since 
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Figure 8.4: Plot of C, versus p for Hald data 


1.0 


0.5 


0.0 


Figure 8.5: Plot of R2 versus p for Hald data 
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the backward elimination method seeks only to remove variables from the model, the 
variable with the smallest incremental contribution to the regression is tested at each 
step to determine whether it can be eliminated from the model. 

This procedure seems to be more economical because it evaluates fewer numbers of 
models than the previous approach. Even though the models examined are more complex 
and require more computing time, often the backward elimination procedure is regarded 
as a good variable selection procedure. We now summarize the basic steps in the method 
as: 


(1) A regression model with all variables included is computed. 


(2) The partial F-statistic (or t statistic) is calculated for each regressor as if it just 
entered into the model. (This is to evaluate the incremental contribution of the 
regressor). 


(3) If any of these partial F-statistics (or t statistics) is smaller than the critical value 
Fout which is a prescribed level of significance (i.e., F-to-remove), remove the vari- 
able from the model and the partial F-value for this new model with p — 1 variables 
is recalculated. 


(4) Repeat (2) and (3) and the procedure will be terminated when the smallest partial 
F-value is not smaller than the critical value Fout. 


Example 8.2 (Hald data) We use the Hald cement data to illustrate the backward 
elimination procedure. Before running the computer package (MINITAB and many other 
statistical packages have these functions), we take o = 0.05 for the cutoff point, so 
Fou = Fi,n-T,0.05 (Or fou = tn—7,0.05). Then any regressor will be dropped from the 
model if its partial F-statistic is less than Fout. The backward elimination process starts 
with the full model and calculates the partial F-values (or equivalently t values) for all 
individual regressors. We now write out the detailed description of the procedure. 

First, the regression model containing all four variables is fit. From the results, this 
is 

9 = 62.4 + 1.552, + 0.51025 + 0.10223 + 0.14424 
with the overall F-value 111.48. Denoting the partial F-value F (zj|v;) by Е, the four 


F-values: 
F3, тт 4.326, F9)1,3,4 = 0.49, 
F311,2,4 == 0.0196 F4n,2,5 = 0.04. 


Recall that the extra sum of squares to obtain all these F-values is given by (5.247). 
For instance, s = 2.446 so that Руза = [(73.815 — 47.863) /1]/5.983 = 4.326. We 
also note that one can get these values of F by squaring the ¢-values in the regression 
of Y onto 11,22,23 and 24, that is, Fi,, = t2 (v is the degrees of freedom). Since 
Fout = F1,8,0.05 = 5.32 from Table A.4b, we drop the variable X3 which has the smallest 
partial F from the model. 

Next, we consider the reduced model with three variables, 21,22 and x4. The result 
shows that the fit is 


g = 71.65 + 1.4522, + 0.41625 — 0.23724 
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and s = 2.309 with the overall F-value 166.83 which is significant at 1% (Ез оо.о1 = 
13.90). Since the three partial F-values are 


F4 = 154.01, Рд = 5.02, Е.ә = 1.88, 


by comparing to the critical value 71,9,0.05 = 5.12, we eleminate the varialbe 24. 
We now work on with the model Y = f (21,22). One can find 


1] = 52.58 + 1.4682, + 0.66222, 


which provides s = 2.406 and the overall F-statistic is 229.50 > F210,0.01 = 14.91. 
We find that both partial F-values for x1, 22 exceed the critical value F:,10,0.05 = 4.96. 
Therefore, the backward elemination selection procedure terminates, yielding the final 
model 

Q = 52.58 + 1.4682, + 0.66222. 


The following Table 8.2 summarizes the steps in the backward elemination procedure 
based on the output from MINITAB. 


Table 8.2 Backward Elemination Method for the Hald Data 
Step Fou, & decision 
1 62.41 1.55 0.510 0.10 -0.14 0.982 | А во.о5 = 5.32 
(t-value) 2.08 0.70 0.14 —0.20 5.0 | = 23 removed 


2 7165 145 0416 0.982 | Fiooos = 5.12 
(t-value) 12.41 2.24 => та removed 

3 52.58 1.47 0.662 - - | 0.979 | Fi10,0.05 = 4.96 
(t-value) 12.10 14.44 - - => Stopped 


8.4.3 Forward Selection 


With this approach, we start with no variables in the model (other than the intercept), 
and add one variable at a time which gives the largest increase in the regression sum of 
squares. This procedure does not permit the removal of a variable from the model once 
it has been entered. The method is summarized as follows: 


(1) Calculate the individual sample correlations or F-statistics for testing the signifi- 
cance of adding one variable to the regression. 


(2) If any of these partial F-statistics exceeds the critical value Fin which is a prescribed 
level of significance (i.e., F-to-enter), enter the variable with the largest F-value 
into the model. 


(3) Repeat (2). That is, the regressor having the largest partial F-value (or equivalently 
the highest partial correlation with y given the other regressors already in the 
model) is added to the model if its F-value exceeds the Fin value. 


(4) The selection procedure will terminate if either the partial F-statistic at a particular 
step does not exceed Fin or if the last candidate variable is added to the model. 
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Although this simplifies the model selection procedure, oftentimes it unfortunately 
leads to the inclusion of variables that do not make a significant contribution once other 
independent variables are entered in the regression model. 


Example 8.3 (Hald data) We will conduct the forward selection procedure using 
the Hald cement data. Let's take the value of a = 0.10. Then Fin = Рүл—ре is the 
critical value in each step in the procedure. 

As a first step, we chose z4 because it has the largest absolute simple correlation 
(ry,4 = —0.821) with the response variable y, and the model using z4 gives the F-value 
= 22.80 > Fiiio1o = 3.296. Then, we continue to proceed according to the steps 
described in the above until no other variable exceeds the critical value. 

We summarize the steps in the forward selection procedure in Table 8.3 based on the 
output from MINITAB. 


Table 8.3 Forward Selection Method for Hald Data 


Step Fi, & decision 


1 117.57 : ~~. =0.738 | 0.675 | Fiiio10 = 323 
(t-value) - - -  —A4T7T| 138.7 => та entered 
2 103.10 — 144 - —- -—G614| 0.972 | Ё\ тооло = 3.29 


(t-value) 10.40 - - —12.62 5.5 => тү entered 

3 71.00 1.45 0.42 - .—0.237 | 0.982 | Р ооло = 3.36 
(t-value) 12.4] 2.24 -  —1.37 = £2 entered 
EE Stopped 


Hence, we conclude that the final model using forward selection procedure contains 2, 22 
and z4, which is 
й = 171.65 + 1.452, + 0.4222 — 0.23724. 


We also note that this may be different from the model we found using backward elim- 
ination selection procedure even if we use the same a = 0.05. That is, the same a does 
not guarantee the same results among different selection procedures. In fact, if we would 
have used a = 0.05, the forward selection procedure leads to a model that contains the 
variables х; and z2. 


8.4.4 The Stepwise Regression Procedure 


Perhaps this is the most widely used procedure for model selection. This procedure 
employs a series of tests (t or F) to check for the significance of the regressors entered into, 
or removed from, the model. Since the stepwise regression procedure is a combination 
of forward selection and backward elimination, the procedure requires two cutoff values, 
Fi, for inclusion and м for removal. 

In each step, all regressors entered into the model previously are reassessed via their 
partial F-statistics. А regressor added at an earlier step can now be eliminated if its 
partial F-statistic is smaller than Fọut. Likewise, a regressor once dropped before may 
be added again into the model if its recalculated partial F-statistic is larger than Fin. 
Often it is convenient to choose Fin = Four. If we take Fin > Fout, then it becomes more 
difficult to add a variable than to remove one. Note that when only one variable is being 
considered, that (i-ratio)? = F-ratio and thus the t-test and F-test are equivalent. The 
following is a description of the basic algorithm. 
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(1) The procedure begins with the variable chosen first, say z(1), that is most highly 
correlated with the response variable Y . This variable is also the one that produces 
the largest partial F-value. If the F-statistic for this model is larger than Fin, the 
model includes the variable. Otherwise, the process terminates with no regressors 
included in the model. 


(2) Regress Y on 2), and a partial F-test is computed for each of the p — 1 remaining 
variables given z(1;. If the largest partial F-value > Fin, then the second variable, 
say z(2), would be included. Otherwise, the process is terminated, and only z(1) is 
included in the model. 


(3) We now determine whether any of the variables already included are no longer 
important, given that others have subsequently been added. If the partial F-value 
> Fin, keep it in the model. If the partial F-value < Къ, drop it from the model. 


(4) Repeat (3) until no other variables are to be entered or removed. 


Example 8.4 (Hald data) We will illustrate stepwise regression procedure using the 
Hald cement data. We take the size of a = 0.10 for both Fi; and Fout. At the beginning 
stage, consider Fin = Р 110.10 = 3.23. The first choice will be to include z4 because 
туд = —0.821 is the highest from the previous result - forward selection. It is also the 
variable with the largest F-value 22.80 > Fin = Fi,11,0.10 = 3.23. Hence, the variable 24 
enters the model. The model is now 


f = 117.568 — 0.738224, 


and the overall F-statistic is 22.80 with R? — 0.675. At the second step, we calculate 
the three F-values: Ра = 108.22, Р. = 0.17, and Ёз = 40.29. Since Fij > Fin = 
F10,0.10 = 3.29, ту is added to the model. Then, for possible removal of the variables, 
recall that Fout = Fi,10,0.10 = 3.29 at this stage. If the partial F-value for previously 
entered variables (here we have x4) is smaller than Fout, then the variable would be 
removed. The two partial F-values we need to consider are: Ра = 108.22 and F4, = 
159.30. But both are larger than Fi, = F1,10,0.10 = 3.29, so we retain both zı and 24. 
Therefore, the model becomes 


Ü — 103.10 + 1.4402, = 0.61424, 


where R? = 0.972 and the overall F-statistic is 176.63, which is clearly significant at 1%. 

The stepwise regression method now considers the next variable to add. Since Р} 4 = 
5.03 > Fzj1,4 = 4.24, z2 will be a candidate. Hence, the model we are considering is 
the one with variables (11, 72,24). The model has an overall F-statistic of 166.83 and 
R? = 0.982. Then, we need to check for possible deletion among the three partial F- 


values; 
Fij2,4 = 154.01, Рд = 5.02, Едо = 1.88. 


As we see, Рап, = 1.88 < Fou = Р,оол10 = 3.36, so we delete variable z4. We note 
that x2 cannot be eliminated because in order to move we must recompute the model 
й = f (21,22). Thus, since no other variables are to be entered or removed, the stepwise 
regression procedure terminates with a final model that contains the variables (21,22) 
which is 

й = 52.5773 + 1.468324 + 0.662322. 
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We present a summary in the following Table 8.4 of the steps in stepwise regression 
procedure based on the output from MINITAB. 


Table 8.2 терен Шш Method for Hald Data 


П.-Р & decision 


Step 


1 117.57 —0.738 | 0. 675 Fin = Ё\уу,о.10 = 3.23 
(t-value) - - -  —4,77 => x4 entered 
103.10 1.44 - - —0.614 | 0. 973 Fin = Б 100.10 = 3.29 
(t-value) 10.40 - -  —12.62 = тү entered 


71.65 145 0.416 [= 
(t-value) 12.41 2.24 


52.58 1.47 0.060 - 0. m 
(t-value) 12.10 1444 - 2.7 
pouce psc i 


8.4.5 Selection of Models - An Overview 


Criticism of Selection Procedures 


Fin = P9040 = 3.36 
=> жо entered 
Fout = Е, 9,0.10 = 3.36 
=> z4 removed 
Stopped 


We have examined several approaches to selecting the best subset in a regression model. 
It is important to note that none of the stepwise procedures discussed above can claim 
any kind of optimality, and these approaches do not necessarily result in the same set of 
variables in the model. Thus the researcher must be aware of the fact that there is often 
no “uniquely superior” or “best” regression model for a set of T regressor variables. Nev- 
ertheless, since all the stepwise procedures terminate with one final model, inexperienced 
analysts or novices may conclude that they have found a model that is optimal in some 
sense or that they must accept the model uncritically. The aim of selection is always to 
maximize the ability to find out all of the “relevant” information that is hidden in the 
data. Therefore, several approaches can be utilized or combined in appropriate ways in 
attempting to find a “best” regression model, such as two-stage selection procedures. 

Further criticisms can be found in Gorman and Toman [44], Mantel [83], and Hocking 
[57]. 


Choice of Stopping Rules 


For the choice of stopping rules - Fi, or Fout - it should be recognized that in the stepwise 
procedures the choice of a small a will limit the number of models that would be explored. 
To avoid an early termination of the selection procedure, it is better to choose a larger 
than typical values 0.01 or 0.05, say 0.20 or 0.25 (but bear in mind that the size of the 
Type I error becomes larger!). Similarly, a much larger value of o, could also be used for 
Fout in order to increase the number of models explored in the algorithm. 


Further Comments in Selection Procedures 


If the number of variables T' is not too large, say about a dozen, either all possible 
regressions or backward elimination may be preferable in terms of computation. Forward 
selection can be seriously misleading because of the restriction of adding one variable at 
a time. Also it may be using badly inflated estimates of o?, whereas a watch can be kept 
on this with backward elimination. 
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If multicollinearity is known to be present then ridge regression, or some other biased 
estimation procedure, can be used as a method of selecting variables. (See the next 
chapter.) 

Consequently, the selection procedure leads to two separate testing problems: one 
is the problem of whether the estimated regression coefficients are significantly different 
from zero and the other is the problem of testing the difference between models. Forsythe, 
Engelman, Jennrich and May [35] proposed a permutation test for the first problem, 
and the second problem is a multiple comparison problem of the Scheffé type [101]. 
See Spjgtvoll [107, 108] for more details. Unfortunately both authors do not deal with 
selection bias in the estimators. 

Once a best model has been found, a thorough residual analysis should be performed 
to evaluate the aptness of the model. In conclusion, although the various calculation ap- 
proaches provide guidelines for variable selection, the model ultimately developed should 
take into consideration such factors as its simplicity, interpretability/predictive ability, 
and the usefulness of the variables. 


8.5 Exercises 


8.1 Gorman and Toman [44] discussed an experiment in which the rate of rutting was 
measured on thirty-one experimental asphalt pavements. Five predictor variables 
were used to specify the conditions under which each asphalt was prepared, while a 
sixth dummy variable was used to express the difference between the two separate 
blocks of runs into which the experiment was divided. 


The multiple regression model used to fit the data was: 


6 
Y =&+) Bjm; += (8.39) 


j=l 


where 
Y = log(change of rut depth in inches per million wheel passes), 
хі = log(viscosity of asphalt), 
х2 = per cent asphalt in surface course, 
хз = per cent asphalt in base course, 
x4 = dummy variable to separate two sets of runs, 
25 = per cent fines in surface course, and 


тв = per cent voids in surface course. 


You may assume that Equation (8.39) is “complete” in the sense that it includes all 
the relevant terms. Your assignment is to select а suitable subset of these terms as the 
“best” regression equation in the circumstances. 

Using Table 8.5, answer the following questions and show the steps and information 
that led to your answer. 


(a) What is R? for the model with variables 1 and 2 in it? Then, what is s?? 
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(b) Calculate the Cp statistic value for the model with variables 1, 2,3, and 4 in it? Give 
a comment. 


(c) Select a suitable subset of independent variables as the “best” regression equation 
if the backward elimination procedure is used. Take a = 0.1. 


Table 8.5 Residual Sums of Squares (#55) for All Possible Models 


Model* RSS Моз RSS Моде RSS 


* Variable numbers included in the regression model with a 8, term. 


8.2 Using Table 8.5, explain the forward selection procedure in selecting a suitable 
subset of independent variables as the best regression model. Take a = 0.05. 


8.3 Using Table 8.5, select a suitable subset of independent variables as the “best” 
regression equation if the stepwise regression procedure is used. o values are 0.5 
for entry (or in) and 0.1 for remove (or out). 


(Hint: The residual for the regression containing no predictors but only ĝo will give 
you the corrected total sum of squares (T'SS).] 


8.4 (Hocking [56]) Show that C, < p if and only if F < 1 where F is the F-statistic 
for testing the hypothesis that m + 1 — p of the regression coefficients 8; are zero. 


8.5 Show that for m = 1, the diagonal term of the HAT matrix is 
2:9 
1 ti— T 
hii = = + ы o 
ТЕ pos 1 (2; Жып т)? 
[Hint: The diagonals are measures of standardized squared distance.] 


8.6 (Gorman and Toman (1966, p. 50) [44]) Suppose that we wish to omit the regressor 
2; from a multiple regression model with p parameters. If Р; is the F-statistic for 
testing Ho: 8; = 0, show that 


Cp-1 = 
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Chapter 9 


Multicollinearity: Diagnosis 
and Remedies 


9.1 Introduction 


As we pointed out in Chapter 5 one of the more difficult problems that occurs in doing 
a regression analysis is that of dealing with multicollinearity. In Example 5.17 we illus- 
trated some of the consequences of this phenomenon; i.e., the difficulty in interpreting 
the apparently contradictory facts of a good overall fit as indicated by a large value of 
R? and a significant observed F ratio, simultaneously with insignificant values for the 
individual ¢ statistics, while in [76] it was shown how multicollinearity could seriously 
affect the accuracy of regression calculations. In addition, some of the problems with 
variable selection techniques as we observed in Chapter 8 are often associated with strong 
multicollinearity in the data. 

In this Chapter we will examine this problem in greater detail. In particular, we will 
discuss methods for detecting multicollinearity, elaborate on its statistical consequences 
and examine some of the proposed remedies, particularly the method of ridge regression. 
This latter technique, which has spawned volumes of research in the past 25 years [28, 
58, 59], is still controversial as are other forms of biased estimation [8, 116] and is by no 
means espoused by all statisticians as the work of Draper and Van Nostrand [28] and 
Draper and Smith [27] shows. However, because of its prominent role in current research 
and applications, no modern treatment of regression analysis would be complete without 
some discussion of its use. At present, it is fair to say that many of the computational 
problems associated with multicollinearity have been overcome through the development 
of more sophisticated computational techniques [5, 112] such as the QR. decomposition 
and the advent of powerful computers. While the problem of building and interpreting 
models with this problem is far from being totally resolved [116] and perhaps never will 
be. 
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9.2 Detecting Multicollinearity 


As we indicated in Chapter 5, multicollinearity occurs in a regression problem if the 
columns of the design matrix are “approximately” linearly independent. Clearly, this 
is a subjective notion, since the notion of “approximate” may vary from problem to 
problem and more importantly from analyst to analyst. Notwithstanding this vagueness, 
considerable effort has been devoted to at least partially quantifying this notion. 

As pointed out in Belsley, Kuh and Welch [8] historically, many of the attempts to 
do this have been seriously flawed (some of these procedures will be discussed shortly). 
As they note, this problem is not unique to regression analysis; it is a classical prob- 
lem in numerical analysis associated with solving any set of linear equations; that of 
ill-conditioning. In numerical analysis this problem has been discussed in great detail 
and they recommend the use of standard numerical techniques based on the SVD for 
diagnosing the problem. We shall, for the most part, follow their recommendations. 

Before we begin our analysis it is important to decide which form the design matrix 
to use in defining and detecting multicollinearity since this issue has been a matter of 
some controversy itself. If there is an intercept in the model then five possibilities have 
been suggested; 


(i) Use the original design matrix 
Ke | | (9.1) 


where X* is the n x m matrix of regressor variables and 1 is a column of ones. 


(ii) Use 
X.=[1| Xz | (9.2) 
where X? is the matrix of centered regressor variables. 
(iii) Use 
X3;=[1 | X*. | (9.3) 
where X*, is the matrix of centered and scaled (as in Example 5.9) regressor vari- 
ables. 
(iv) Use 


Xa=[ 1//n | Xi | (9.4) 


where X; is the matrix of scaled regressor variables obtained by dividing each 
element of the i-th column by its Euclidean length. 


(v) Use 
Ху = Xj. (9.5) 


BKW recommend using Хз because this enables one to detect approximate depen- 
dencies involving the intercept. On the other hand, Draper and Smith [27] and 
Montgomery and Peck [87] favor the use of X5. 


Since our focus in this Chapter will be more towards remedies rather than identifying 
specific collinearities we will perform all diagnostic procedures on X%,. This is then 
consistent with the recommended approach to ridge regression favored by Draper and 
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Van Nostrand [28]. Consequently, throughout the remainder of this chapter we will 
assume that the linear model is of the form 


Y =Х б, +e (9.6) 


where Y is also centered and scaled and the errors &j,1 < i < п, are independent 
М (0,07). The transition between the least squares (and other) estimators of [3 in the 
original model Y = 18, + X*8, + € will be made in accordance with the prescription 
given in Example 5.9 when y is also scaled and centered then 8 = 528, where 8 isa 
scaling matrix. (See Example 5.9.) To avoid excessive notational complexity we will 
generally just use X to refer to X7. and Ó instead of O,. 


Eigenvalue Criteria 


Recall that in Chapter 5 we said that X exhibits multicollinearity if there exists a vector 
с = (c1,..,c4)^ which has length one (Toii c = lel] = 1) such that 


5 CiXi VE (9.7) 
i=1 


where є is a “small” vector in the sense that |le|| is small. 

The size of |le|| indicates the degree of the approximate linear dependence among 
the columns of X. Of course c is not unique, with the relative sizes of the components 
in different c's indicating, perhaps, qualitatively different dependencies. Large (rela- 
tive to one) values of c; indicate which variables are primarily involved in a particular 
approximate collinearity. 

Since (9.7) is not a computationally convenient criterion to work with, many suppos- 
edly equivalent definitions of multicollinearity have been given [8]. However, as indicated 
in BKW, many of these conditions are neither necessary nor sufficient for (9.7) to hold, 
but not both. On the other hand, examination of the eigenvalues of XTX does. 

To show this we consider the SVD of X as described in Section 4.9. If uj, 1 € i € n, 
are the right singular vectors of X, then 


Хи; = шУ; (9.8) 


where у; is the i-th left singular vector of X апа и; is the i-th right singular value of X. 
Now if u; is small, then ||u,;v;|| = и; ||vil| = u; since ||v:|| = 1. Letting є = u,v; we find 
that 

Xu; — e. (9.9) 


Letting су = (u;);, the j-th component of uj, (9.9) becomes 


У? суху =E (9.10) 
j=1 


where ||c|| = 1 because u; is the i-th normalized eigenvector of X" X. Since и; = 
VAi, where A; is the i-th eigenvalue of X" X, our basic multicollinearity diagnostic is to 
examine the eigenvalues of X" X. Eigenvalues near zero indicate near dependencies in 
the data. In fact, (9.10) gives more, it tells us quantitatively what these dependencies 
are. 
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Conversely, if (9.10) holds then XTX has a small eigenvalue. For this we observe 
from linear algebra that if Amin is the smallest eigenvalue of XTX, then 


Amin 


min (d,X"Xd)- min (Xd, Xd) 
{14[=1} {ї4=1} 


(Xe, Xe) = Хеј = Jell? (9.11) 


I^ 


where c is given by (9.7). Thus, 
Amin < [lel (9.12) 


so that if ||e|| is small, then XTX has at least one small eigenvalue, Amin- 

This analysis indicates that a multicollinearity exists if and only if XTX has at least 
one small eigenvalue. These can be found by computing the spectral decomposition of 
XT X. However, for numerical accuracy and stability it is generally preferable to compute 
the SVD of X by an algorithm such as that given in [112]. 

Summarizing, multicollinearity exists if and only if XTX has at least one small eigen- 
value A. For each small eigenvalue а dependency is obtained by 


n 
Уу cix; ~ 0 (9.13) 
i=1 


where c = (0,02, ..., eye is a normalized eigenvector of XTX corresponding to A. 

The question now arises as to how small an eigenvalue needs to be in order for a 
corresponding approximate linear dependency to exist. Again, this is at least a partly 
subjective matter, but some guidelines can be given from numerical analytical consider- 
ations in solving the normal equations X1 X = XT y. 

As indicated in Section 4.9 the condition number, k (XTX) = Hmax/Hmin governs the 
numerical stability of solving the normal equations. Since the singular values of XTX 
are the square roots of the eigenvalues of XTX) XTX = XTXXTX = (хтх)?, they 
are the squares of the singular value of X* X, then 


EIX X) S usn (9.14) 


where Алах and Amin are the maximum and minimum eigenvalues of XTX. 

In BKW [8] they suggest that much real data may be known at best to four sig- 
nificant figures (since their interest was primarily in economics problems many of their 
observations and thus recommendations are drawn from an intimate knowledge of that 
type of data - they may not be universally valid [8]). If one wishes to preserve at least one 
significant figure in the solution, then one should tolerate condition numbers no larger 
than 1000. As a consequence, they advise that if the condition number exceeds 1000, 
then a multicollinearity problem may be present. This of course is equivalent to having 
Amax/ Amin > 1000 or Amin < Amax/1000 as an indication as to how small Amin can be for 
ХТХ to be ill-conditioned. 

If we define the condition number of X as 


R(X) = J&(XTX) (9.15) 


then a condition number of X greater than 30 is considered to be potentially damaging 
to the regression analysis. As a rough guideline, all eigenvalues А of X7 X such that 
Amax/ ^ > 1000 indicate that a corresponding near dependency exists in the data. 
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Example 9.1 We consider the drink delivery data in Example in 5.14 to check 
for the possibility of multicollinearity problem in the data. First, we find the sample 
correlation matrix of (y, £1, £2), R is 


1.0000 0.9646 0.8917 
R = | 0.9646 1.0000 0.8242 
0.8917 0.8242 1.0000 


The standardized form of XTX is calculated as 


1.0000 0.7924 0.7891 
XTX =| 0.7924 1.0000 0.9341 
0.7891 0.9341 1.0000 


Then, setting IXTX — А) = 0 we obtain the eigenvalues 
Ay = 2.6790, А = 0.2552, Аз = 0.06589. 


Hence, the condition number is 


Аах 2.6790 


t ТҮҮЛҮ Moa 


K = 
So we conclude that there is no serious multicollinearity problem in these data. 


Example 9.2 Using the Hald cement data in Example in 5.16, we check for the 
potential of a multicollinearity problem in the data. First, we find the sample correlation 
matrix of (21, 25,23, z4, y), R is 


1.0000 0.2286 | —0.8241  — 0.2455 0.7307 

0.2286 1.0000 —0.1392  —0.9730 0.8163 

R = | —0.8241 —0.1392 1.0000 0.0295 — —0.5347 
—0.2455  —0.9730 0.0295 1.0000 —0.8213 

0.7307 0.8163 | —0.5347 —0.8213 1.0000 


From this we observe that (21,23), (21,24), and (22,24) have a strong linear dependency. 
From the original XTX, we have the normalized form of XTX, which is 


1.0000 0.79715 0.95496 0.88617 0.88136 

1.0000 0.80216 0.47584 0.63255 

XTX = 1.0000 0.82713 0.70537 
1.0000 0.78750 

1.0000 


The four eigenvalues for the Hald cement data are 
А, = 4.1196, А = 0.5539, Аз = 0.2887, Aq = 0.03768, А = 0.00009. 


Hence, the condition number is 


Атах _ 4.11960 
Amin 0.00009 


= ~ 45, 773.33, 
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which implies that there exists a serious multicollinearity problem in these data. 
For further analysis of the data, we also present the matrix of corresponding eigen- 
vectors е;. 


e [е1, ео, ез, ел, es] 


—0.49207 —0.02282 0.03065 0.22485 —0.84015 
—0.40060 0.76625 —0.13253 —0.47775 0.08112 
—0.46744 0.14241 0.51703 0.55101 0.43623 
—0.43481 —0.56697 0.31671 —0.61265 0.11765 
—0.43570 —0.26572 —0.78350 0.20557 0.28884 


From our previous analysis the eigenvectors can be used to obtain the approximate 
collinearities that exist in the data. We first calculate the condition indices 


Ki = Аах/№, 1 <i <5. (9.16) 


Those eigenvectors for which к; > 1,000 then indicate the approximate collineari- 
ties. In fact, if e; = (ei1,6i2,...,€;5) is an eigenvector for which к; > 1,000, then an 
approximate collinearity is given by 


БЭ eijti = 0. (9.17) 
ј=1 
NUS 4.1196 
кі = 1, Ko = ЙРТ = 7.437, 
4.1196 4 PrüG 
РЗ = 14.269, 109.33, 
3 02887, Ка = 0.03768 
= 45, 773.33. 


"5 = 0.00009 
Hence, ку is the only condition index which exceeds 1,000. Thus an approximation 
collinearity 


—0.84015 + 0.081122; + 0.436232» + 0.1176523 + 0.28884z4 = 0 (9.18) 
exists іп the data. Dropping the coefficient of x; we obtain the approximate collinearity 
—0.84015 + 0.43623z2 + 0.117653 + 0.28884z4 = 0. (9.19) 


This suggests that xı should be included in the model, but at least one of (22,23,24) is 
superfluous. This is consistent with our variable selection results in Chapters 5 and 8. 


9.3 Other Multicollinearity Diagnostics 


Although it is now generally recognized that the eigenvalue structure of X7 X is the most 
reliable way of detecting multicollinearity, historically many other methods have been 
proposed and a number are in common use today, even though they may be defective 
in one or more ways. In particular, because the eigenvalues of XTX do not have a 
simple statistical interpretation and may themselves be difficult to compute, these other 
measures can often be useful adjuncts, if not a replacement for the eigenvalue analysis of 
XT X. A number of things that one frequently looks for are the following: 
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(i) Large correlations between the regressor variables 


Typically, correlations exceeding 0.9 indicate the possible existence of multicollinearity 
problems. Intuitively, a correlation of this size between two columns of X suggests that 
these columns are approximately linearly related. However, a high correlation between 
two variables may exist because these variables are implicated in a near dependency 
with other variables [8], so even though large correlations indicate the presence of mul- 
ticollinearity, they do not necessarily indicate the full nature of the dependencies. 

On the other hand, strong (even exact) near dependencies may exist in X with all 
the pairwise correlations being small. Thus examining the correlation matrix (XX) 
of the original regressors is not a totally satisfactory way of detecting multicollinearity. 

As an example of this latter phenomenon, suppose we have m— 1 independent random 
variables X1, X5,..., X44, ; such that E(X;)) =0,1 <i € т – 1, and Var(X;) = 1,1 < 
iX m-1l Let Xm = уусу Xj then Cov(X;, X) = Qi # 7,1 < i,j <m-1 
and Cov(X;,X;) = Мат (Х;). Since o (Xm) = ут — 1, the correlation matrix p of 
X1, X5, ..., X4, is given by 


1 0 0 = 
1 
0 1 0 de 
p= : 0 bad : i | (9.20) 
0 n ЖЕ Же —. 
PT: TS 1 1 
Vm-1  V/m-1 m-—1 


Obviously if m is large, all the correlations will be small, yet p is singular due to the 
linear relation Xm = p» ur X;. 


(ii) Large values of the variance inflation factors 


А typical guideline, as discussed in Chapter 5, is to regard any VIF > 10 to be possibly 
damaging to the analysis. This procedure is certainly sensible in view of the expression of 
the VIF in terms of the reciprocals of the eigenvalues of XTX. In addition, the relation 


1 


(9.21) 


where R? is the multiple correlation of the regression of х; on x;,j = 1,2,..., m, i £ j, 
shows that a large value of any VIF; corresponds to values of R? close to one. This, in 
turn indicates a possible approximate linear relation between the regressors. (note that a 
value б; > 10 corresponds to R2 > 0.9). However, since we have already noted that it is 
not always easy to relate the “strength” of a possible linear relationship to the size of R2, 
BKW find some fault with this indicator because of its lack of precision. On the other 
hand, VIFs have an easily understood statistical interpretation and frequently provide a 
reliable indicator of the presence of one or more collinearities. Since they are routinely 
computed when the regression is done in correlation form, they are readily available and 
should be examined in any regression analysis. In particular, if one is using а package 
where eigenvalue analysis is not available, they make a convenient, and usually reliable 
substitute. 
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(iii) Small values of det(X7 X) 


This is à useful and again readily available diagnostic, but suffers from the deficiency 
of not being able to delineate the nature of the near dependencies. However, if this is 
not of interest then examining det (XT X) can be a reasonable alternative to a complete 
eigenvalue analysis. 


(iv) Large estimated regression coefficients 


The rationale for this diagnostic stems from the following observation. Consider E (( В, B)) ; 
the expected square length of 3. Then, 


E ((8.й)) 


I 
Y 
сз 
Zn 
ч, 
~ 


= (600) 0) 1Р0). x 
| | 


which gives 
aA ^ c? 
E ((&.8)) = 8.0) + 3 S (9.24) 


j=1 


Thus, if ХГХ has a small eigenvalue, (B, B) on average will be much larger than the 
true value (3, 3). Thus the presence of multicollinearity tends to inflate the estimated 
length of (8, 3) which suggests that at least one of | B Я ‚1 < j < т wil be large. 


(у) Large standard errors of 8 


As noted in Chapter 5, 
n g2 
Var (8;) = о? ‘Ss d (9.25) 


so that a small eigenvalue А; of XTX may lead to a large value of this quantity. This 
generally indicates that the estimation of 8; is unstable and that considerably different 
values would be obtained if the data were slightly altered. Statistically, this leads to 
wide confidence intervals for 8; and a corresponding insignificant observed t value. Since 
the presence of just one approximate dependency may inflate the variances of all of 
the coefficient estimates, one may find that many or all of the estimated coefficients 
fail to appear significantly different from zero even if the overall fit is excellent. This 
phenomenon was seen in the Hald data in Example 5.16 and the Longley data in Example 
5.17. 
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(vi) One or more estimated regression coefficients of the wrong sign 


Often in practice a theory and/or common sense suggests what the sign of a regression 
coefficient ought to be. For example, if one was building a salary model for a particular 
population one would anticipate that “years of service” would be a significant variable 
with a positive coefficient. Estimation of this coefficient from a particular set of data 
leading to a negative coefficient would surely cause one to become concerned. However, 
if the model also had “age” as a variable then multicollinearity due to these variables 
might be anticipated leading to possible large standard errors of the coefficient for “years 
of service”. The resultant instability in estimation could then lead to the wrong sign for 
this coefficient. 

A detailed discussion of this phenomenon may be found in BKW [8]. On the other 
hand, a wrong sign may be due to bias resulting from misspecification of the model or 
to large values of c?, rather than the problem data. 

In summary, it is wise to keep in mind that multicollinearity is a problem with the 
design matrix and one should diagnose its presence using measures which depend only 
on X and not on quantities which involve the dependent variable Y. In this regard, in 
examining the sizes, signs and £ values of B j 1 SJ < т, one is really looking more at the 
consequences of multicollinearity rather than determining its presence. In this regard, 
the eigenvalues of XTX, the variance inflation factors and det (XTX) are useful for this 


purpose, while B and various statistical calculated quantities may not be. 


9.3.1 Consequences of Multicollinearity 


We have already discussed the consequences of multicollinearity on the least squares 
estimation of 3 in a number of places. So here we merely summarize our previous 
observations. First, multicollinearity makes XTX highly ill-conditioned and this can 
lead to large round-off errors in the numerical calculation of В for any method that uses 
the normal equations to compute 8. To a large extent, this problem can be alleviated 
by using techniques such as the QR decomposition of X which avoids forming X7 X. In 
addition, with today’s computers which have high precision, this problem is less serious 
than in the past. 

Second, estimation of at least some components of 8 may be unstable due to the 
presence of multicollinearity. (This instability may often be demonstrated by observing 
substantial changes in the coefficient estimates if a variable or observation is deleted.) 

This may give estimated coefficients with the wrong sign and/or too large a mag- 
nitude. In addition, many variables may have small ¢ values leading to the apparent 
contradiction of having an overall good fit with very few or no significant coefficients. 
In this regard, it is of some interest to be able to determine a priori which coefficients 
may be estimated poorly and we digress briefly to discuss a further diagnostic procedure 
introduced by BKW for this purpose. 


Let E 
931 № 
Tig = , 
VIF; 


7 = 1,2,..., (т +1) (9.26) 


denote the proportion of the variance of B; due to A;. (This is called the variance 
decomposition proportion by BKW.) If for fixed i, л; is large for two or more values 
of j, then A; is contributing a large proportion of the variance of those variables. If 
А; is an eigenvalue responsible for an approximate dependency, then these variables are 
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implicated in the dependency and that dependency will generally degrade the estimation 
of the corresponding (,’s. As a guideline BKW suggest that if 7;; > 0.5 then 7;; should 
be considered large for any A; whose condition index Hi/Hmax > 30. (This diagnostic 
should produce essentially the same information as an examination of the sizes of the 
coefficients of the eigenvector(s) corresponding to A;.) 

Further difficulties arise in variable selection as has been pointed out in Chapter 8. 


9.3.2 Prediction 


In contrast, to estimation, multicollinearity need not cause problems with using the 
model for prediction, so long as one predicts at points which are consistent with the 
collinearities in the original data. To elaborate, we suppose that the model fits the data 
overall and we want to use the model to estimate the response Y as some point хо. As 


usual, the precision of this estimate will be measured by Var A which according to 
Eq. (5.286) is estimated by 


Var (Үл) = s? (xo, (XTX) x5 ) | (9.27) 


We analyze this further using the spectral decomposition of (XTX) `. From (4.125) 
(XTX) = QATİQ so that 


(xo, (XTX) a: xj) 


(xo, QA"! Qxj ) = (Q? x0, A"! Qx) 
= (v, Ауу = 2. 


5 


(9.28) 


where у = (v1, V2, nous = QTx,. 

Now suppose for sake of argument that А; is a small eigenvalue which contributes a 
near dependency X. The corresponding component v2/A; in (9.28) will be large, unless 
the effect of 1/A; is counterbalanced by a small value of v;. For this to happen, we 
observe that 


= (Ө? xo), 22 Qji20j (9.29) 


where Q = [9;;]. 

Now (9;:),1< j < m, is the i-th column of Q which is the i-th eigenvector of XTX. 
Recalling that this eigenvector defines a collinearity, it follows that if xọ satisfies this 
collinearity, then v; ~ 0. Thus, if xo is consistent with the collinearity, determined by 
М then v? /А; may be small even if 1/A; is large. Hence, if xo is consistent with all the 
collinearities in the original data, then Var (Yx,) may still be small in the presence of 
small eigenvalues. 

One can think of this phenomenon in geometric terms as follows. Roughly speaking, 
one can think that multicollinearities in the data restrict the columns of X to lie in a 
subspace of the full m-dimensional column space of X. Thus prediction can be reliable 
for points in this subspace but generally unreliable for points orthogonal to this space. 

With the above argument in mind, some authors take the view that if severe multi- 
collinearity is present one should not concern oneself with estimation, because the data 
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may be non-informative in this respect. The model may be used for prediction, but only 
at points consistent with the collinearities in the original data. Since the observed data 
are unable to provide one with information about the nature of the model outside of 
this region. Apparently this is a view held by only a minority, if the recent explosion 
of research on “improved” estimation methods seems to indicate [26]. We now turn our 
attention to this subject. 


9.4 Combatting Multicollinearity 


Suppose that one has concluded that one or more strong multicollinearities are present 
in the data and one is interested in estimation and prediction at points not necessarily 
in line with the observed near dependencies. What, if anything, can one do? Ignoring 
computational problems, the primary effect of multicollinearity will be to produce large 
estimation and/or prediction errors if least squares estimation is used. Since the Gauss- 
Markov theorem says that the least squares estimator is the best linear unbiased estimator 
(BLUE) of 8, reduction in variance will then be feasible only if we use some nonlinear 
estimator or a biased linear estimator. Both of these approaches have been utilized, with 
current statistical thought seeming to favor biased linear estimation. It is this approach 
that we shall discuss in some detail. However, before resorting to such an approach, 
it is worthwhile to examine what remedies may be available in the context of classical 
estimation. 

First, coefficients of variables not involved in any near dependency may not be affected 
by multicollinearity. If one is only interested in these coefficients then no remedy need 
be taken. However, in most cases estimation of all of the coefficients will be of concern 
so that this circumstance is probably mostly of theoretical interest. 

More generally, certain linear combinations of the regression coefficients may be esti- 
mated accurately even if the individual coefficients cannot. Using the model for prediction 
is а particular case of this and as has been pointed out in the previous Section, certain 
predictions may require no remediation. 

As Belsley, Kuh and Welsch (BKW) [8] emphasize, multicollinearity is a problem 
with the design matrix X, and is not a statistical problem, so initial strategies might 
focus on modifying X if at all possible. In this regard the following approaches have 
been suggested. 


(i) Collect new data 


If multicollinearity is present because of the way the data were collected, then collecting 
new data may solve the problem. For instance, collecting data in а narrow range of any 
regressor variable x may introduce an approximate linear dependence with the intercept. 
As a particular example, in economics, regression models are often used to model “total 
consumption” in a given year as a function of previous years consumption, “total income” 
and other possible variables such as interest rates [8]. Until recently, interest rates varied 
over rather narrow ranges and when used in a regression model were often found to be 
statistically insignificant as indicated by the usual t-test. Because of the approximate 
linear dependence of the interest rate variable and the intercept, the problem of whether 
to include an interest rate variable in these equations has been a source of much con- 
troversy [8]. In this regard, the large variations in interest rates during the 70's, 80's 
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and 90’s may provide useful new data to economists even if they were not beneficial to 
consumers. 

In many cases new data cannot be collected and the analyst is stuck with what he 
has. In addition, if the variables are approximately collinear, by definition, it may not 
be possible to modify X in any simple way. 


(ii) Model respecification 


Changing the variables may help. For instance in polynomial regression, artificial collinear- 
ities may be introduced by using uncentered variables - centering the variables may 
readily solve the problem. 

Eliminating one or more of the variables present in any near dependency is a classical 
solution, and variable selection techniques as discussed in Chapter 8 are often employed 
in this regard. However, such procedures may seriously bias the resultant model and 
result in poor predictive models, particularly if the multicollinearity was caused by the 
choice of variables in available data, and are not due to any time dependency of the 
underlying variables. (Note that in reality such techniques are really biased estimation 
procedures but often not discussed in that context.) 

In this regard the technique of using *auziliary regressions” following the delineation 
of the multicollinearities using the techniques of BKW [8] may be of value as an alterna- 
tive to standard variable selection techniques. 

If none of these approaches is feasible, then attention usually shifts to improving the 
estimation using the data at hand. 


9.5 Biased Estimation 
9.5.1 Shrunken Estimators 


Since we have seen that one of the degrading effects of multicollinearity is to produce 
“large” unstable estimates of G, it is reasonable to focus on biased estimation procedures 
which try to shrink the size of the least squares estimator of 8; and/or its variance. 
As we pointed out previously, due to the Gauss-Markov theorem, biased estimation is 
necessary in this regard if we wish to stay in the framework of linear estimation. 

The idea is similar in philosophy to that used in variable selection (which we have al- 
ready indicated is a possible solution to the multicollinearity problem), that is, hopefully, 
we can trade a small bias in estimation for a large reduction in variance. This suggests 
that we consider the class of estimators (B for biased) 


Bs = AB (9.30) 


where А is an m x m unknown matrix and В is the least squares estimator of 8. A then 
has to be chosen to produce a desirable trade-off between bias and variance. 

As we did in the variable selection problem, this is usually done by picking A to 
minimize the MSE of 8 p. Since choosing A = Im gives the least squares estimator, the 


minimizing value of A (if it exists) will have a MSE no larger than Var (8). Thus, this 


approach appears a reasonable way to solve the problem. On the other hand, we have 
no guarantee that such constants exist. We begin by assuming that А is a scalar so that 


Вв = cf. 
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In this case 


E (85) —cE (8) = c (9.31) 
and 
Var (B.) = Var (8.) = c?o?ój. (9.32) 
Thus the bias vector is 8 — cB = (1 — c) 8 and 
MSE (s) — Var (ôs) + (Bias)? (9.33a) 
= оо? 57 + (1 о)? yg. (9.33b) 
i=1 i=1 


Minimizing MSE G в) with respect to c gives the optimum value of c as 


э ИЩ 
>: 5; не” Уу б; 


From this we see that 0 < Copt < 1 so that B is shrunk towards zero and an easy 


Copt — (9.34) 


calculation shows that MSE (д в) < Var (8). Hence, Copt gives an estimator which 


appears to counteract some of the deleterious effects of multicollinearity. However, in 
examining Copt) some problems are evident. First, using Copt 3 shrinks all coefficients 
equally and a priori there is no reason why this should be done. Similarly, if we want an 
estimator that can correct incorrect signs then Copt is unable to do this. 

A further difficulty (and this is characteristic of virtually all currently used biased 
estimators) is that Copt depends on [8 and о? - which are unknown - so that strictly 
speaking Copt is not an estimator. If we now estimate Copt by 


т „2 
эн. 
p теу 


and use MT to estimate 3, then Cont Ê is an estimator but now we have no guarantee 
that this estimator has smaller MSE than 8. {in case XTX = Im, there are estimators 
of the form eB, where ¢ depends on B and s?, which are known to have smaller MSE 
than 8. This is the famous James-Stein estimator [67]. Further discussion can be found 
in [32, 33, 118].) 

To remedy the problem of equal shrinkage we could use m different shrinking factors 


(9.35) 


Copt — 


^ А ie 
cil € 1 € m, and the estimator (eos сойо, eme . The cj's can be chosen to 


minimize the mean square error given by 
Ула Уа-а ‚1<1< т. (9.36) 


Doing this gives 
<1< т. (9.37) 
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If we replace c?” with 
42 


dec Ms (9.38) 
B; + 826; 
again we arrive at a true shrunken estimator of 8 which again is not guaranteed to 
minimize the MSE. 
Since these estimators seem not to be used in practice, we will discuss them no further. 
Rather, the reader can view this section as a prelude to the basic ideas of ridge estimators 
which shrink 8 in a much more complicated way. 


9.5.2 Ridge Regression 


Although shrunken estimators appear reasonable as а way of combating at least some 
consequences of multicollinearity, they seem not to be used in practice, perhaps because 
they take no direct account of what happens to the fit in terms of attempting to minimize 
SSE. Thus, another point of view towards shrinking £8 is to choose Bp so that it 
minimizes SSE subject to the constraint that (B B B в) < (8, 8). In this case, we try 


to find Bp as that which solves the constrained minimization problem: 
Minimize SSE = (y - XB,y — XB) 


| С Е | (9.39) 
subject to : (Bn. Bg) = Ф < (2,8) ‚ (R for ridge) 

where d > 0 is a prior restriction on the length of the parameter vector 8. On the 
other hand, it can be shown that shrunken estimators of the form c@ are solutions to the 
minimization problem 


Minimize (V(y - X8), V (y - X8)) 


(9.40) 
subject to: (8,8) = d? 
where V — (XTX)" X. Since the criterion function (V (y — X8), У (y — X)) is not 
the SSE, the solution to (9.39) seems to be more appropriate. 
As we show next, the solution to (9.40) leads to the so called ridge estimators which 
have received much attention as possible *cures" for the multicollinearity problem [?]. 
We begin by showing that the solution to the minimization problem (9.40) satisfies the 


ridge estimation equation И 
(XTX + Л.) Bg = XT y (9.41) 


where А is chosen to satisfy the constraint 
(Bn. Bg) = ё. (9.42) 


To prove this, we again use the technique of Lagrange multipliers [104]. 
Thus, let 


and then the minimizing values of (3, А) are obtained by solving 


91/98 = 0, OL/OX = 0. (9.44) 
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Doing these differentiations 


E = 2XTXf -2XTy + 248 = 0, (9.45) 
A = (8-e-0 (9.46) 


and calling a solution to (9.43) Gp, we see that (9.44)-(9.46) yields (9.41) and (9.42). We 
will now show that if А > 0 and X has full rank then (9.44)-(9.46) have a unique solution. 
This solution is called the ridge estimate of В and A is called the ridge parameter. (Of 
course А is a function of d.) 

First, observe that for any А > 0, ХТХ + AI has an inverse (this was shown in 
Chapter 5 for А = 0, the proof for А > 0 is similar since it is easily shown that XTX + AL,, 
is positive definite. Thus, (9.43) has a unique solution 


Bg = (XTX + A4) ^ XTy. (9.47) 


We now need to show that А can be chosen to satisfy (9.47) for any d > 0. For this we 
will need an alternative expression for (B RÊ в) which will be given in Theorem 9.1. 
That is, 


(Bn. Br) = fis С. (9.48) 


where \;,1 € i < т, are the eigenvalues of XTX and 4,= (от) ‚1< < т, where 
t 
Q is the matrix of normalized eigenvectors of XTX. 


Now if А — 0, (Bp, Br) = у" 4A (8, 8), while (9.48) shows that (Bp, Br) 
is a decreasing function of A for А > 0. Additionally, as А — oo, (B Rs B R) — 0 so that 
if d? < (8,8) there is a unique value of А > 0 satisfying (Bn. Br) = Ф. 


If d? were known in advance then we could determine A by solving the nonlinear 
equation 


=d (9.49) 


and the ridge estimator would then be given by 


1 


Bg -—(XTX- A) XTy. (9.50) 


Since, by definition d? « (8, 8), the ridge estimator shrinks B towards the origin as 


do the estimators in the previous section. In this case, the shrinkage is accomplished (see 
Theorem 9.1) by multiplying the least squares estimator by the matrix 


А (А = (XTX -AL,) XTX, (9.51) 


so it is an estimator of the form in (9.50). 
Of course in practice d? will not be known and it will generally have to be chosen 
from the data to provide an estimator which one hopes is somehow “better” than £. 
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However, as is emphasized by Draper and Van Nostrand [28] and Draper and Smith [27]. 
No matter how А is chosen, the ridge estimate can always be viewed as a least squares 
estimator of {3 where prior information is used to constrain the length of the resulting 
estimator. (This interpretation of р has been used as a basis for criticism of many of 
the published simulations studies purposing to show that ridge regression is better than 
OLS [28].) 

To get some intuitive feel for what ridge estimation does, recall that multicollinearity 
is generally indicated by a large condition number of XXT. Since the eigenvalues of 
XTX + AL, are А; + A,1 € d < m, where А;,1 € i € m, are the eigenvalues of хх”, 
choosing small values of А can substantially improve the conditioning of XTX + ALm over 
that of X7 X. 

Example 9.3 Suppose that Amax = 1 and Amin = 1074. Then к (XTX) = 104. 
Choosing A = 107? 

-2 
ge S GE hy 
10-74 + 10-2 0.0101 


a reduction by a factor of 100. Similar reductions can be found for the VIFs. 


& (XT X + AI) 


It is this property of reduction in conditioning which was a major factor in introducing 
the technique of adding a small constant to the diagonal elements of an ill-conditioned 
matrix. It is interesting to note that this idea is widely used in many other areas of 
applied mathematics [41] where it is usually called regularization. 

How to choose А in the absence of realistic prior information on the length of 8 р is 
the most difficult part of ridge estimation and has led to a large number of techniques 
[27, 28, 87] with different methods often providing conflicting results [39]. The difficulty, 
as with the shrunken estimators cB, is that if B в 18 chosen to have certain optimality 
properties, such as minimizing the MSE, then the ridge parameter A will then depend 
on the unknown values of 3. When these values are replaced by estimates, А becomes a 
random variable А with а generally unknown distribution, whose use in (9.50) gives rise 
to an estimator which is no longer known to have any optimality properties. Researchers 
have attempted to resolve some of these problems via Monte Carlo simulations. 

However, as is shown in Gibbons (1981) [39] the resulting properties of the differ- 
ent ridge estimators is again heavily dependent on X and [8 and о? and do not always 
improve on OLS. This property (at least in the author's opinion) only serves to accen- 
tuate the criticism of Draper and Van Nostrand that ridge estimation always implies 
prior information concerning 3. If such information is available and reliable, then the 
statistician may proceed in the spirit of Bayesian analysis to use this information in a 
rational way to (perhaps) improve on OLS. In the absence of such information, the use 
of ridge estimation (or any similar biased estimation method) will not be guaranteed to 
improve on OLS, and may in fact be worse. 

Notwithstanding these caveats, we now turn to a presentation of some of the more 
popular methods for selecting А. In this regard we have been guided by the simulation 
results of Gibbons [39] and the survey of Draper and Van Nostrand [28]. 

To motivate many of these choices we will need to know a number of properties of 
the ridge estimator Зр for a fixed value of А. These are summarized in Theorem 9.1. 


Theorem 9.1 (Properties of B в When A is known) Suppose that Bg is the unique 
solution of (9.39) with А > 0, known. Then, 
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(i) дһ == A (4) B, 
where А(А) = (XTX + An XT X and В is the least squares estimator of В. 


(ii) Br is generally a biased estimator of В if А> 0 with the bias given by 


E (8а) -З=—А(ХТХ + AL) `8. (9.52) 


(iii) The variance-covariance matriz X (à в) of В в i$ given by 


1 


х (Bg) = 0? (XTX + Am) XTX (XTX + Am) (9.53) 
(iv) The mean square error of Bp is given by 
MSE (8а) = Y 6 (Ay + А)? + А2 NC SEO + А)? (9.54) 
j=l j=l 


where 3,1 € j € m, are the eigenvalues of XT X, y; = (Q* 8); ,l<i<m, and 
О is the matriz of orthonormal eigenvectors of XT X. 


(v) (Hoerl and Kennard [58]). There exists a value of А > 0 such that 


MSE (да) < MSE (8) | (9.55) 
(vi) If X>0, 
(Bn, Bg) < (8,8) (9.56) 
Proof. (i) From (9.41) 
Bg = (X™X+XMMm) XTy = (XTX +AIm) (XTX)(XTX) XTy 
(XTX +Аы) X?XB = A (4) B. (9.57) 


(ii) Since 8 is unbiased, 
E (Bp) = AQ) E(B) - A0)82 8 (9.58) 


unless А (А) = Im for all 8 € R™. And this requires А = 0. The bias of Bp is then given 
by 


by 
AL 
UO» 
© 
м; 
| 
Uo 
| 


А (Л) 8—8 = (XTX PALO XX- In| B 
= (XTX-AL,) [-AL8 = -A(XTX + AL,) B. (9.59) 
(iii) Since Bp = A (А) B, 
> (81) = AOS (8) AT (А) = œA (А) (XTX) AT (А) 
SXTXTALO хех (ХХ) ОТ) (хех вА) 


осо AL) XU X0 XS) дая 


H 


(9.60) 
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(iv) From Equation (9.33a) we have 
MSE (84) = Var (Br) + (Bias)? (9.61) 
and using (9.60) and (9.61) 
(Bias)? = X? ( (XTX + X4) B, (XTX + Mm)" 8) (9.62) 


and 


< 
© 
E! 
=» 
wD 
e 
ee” 
|1 


З) 


tr [(XTX + Alm) X7X (XTX Alm) | 


o*tr|(X7X + Alm) ^ X"X]. (9.63) 


Now, XTX = ОЛОТ so that (XTX + AL) = QAQ™ + AL, = Q (A + Mn) QT 
which gives 


(XTX AL) ^ = Q(A - A14)? QT (9.64) 
and E 
(XTX + Mm) XTX -Q(A-214) ? AQT. (9.65) 
Thus, 
tr (XTX + Am) X7X] = tr [QA + Mn)? AQT] 


Q 
= tr (Q7Q( (A) ^| 
al (АМ? A (since ОТО = L,,) 


2 3 Of raya Um 


Using this in (9.63) gives the first term in (9.61). 
Again, using the spectral decomposition of XTX and the orthogonality of Q 


| 


(Bias)? = X? (Q(A + Mm)? 078,0 (А + Am)” Q78) 


P ((A + Am)! QT8, (A + AL) | ав) 


ү; 
_ у^ _ (9.67) 
2. (м +)” 
where y; = (Q78), Finally, using (9.66) and (9.67) in the expression (9.61) for 
MSE (дк) we get 


Ж ш AM? + A;o? 
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(v) From (9.68) we see that if A = 0, then MSE (8а) = MSE (8) so that in order 
to prove (v) it suffices to prove that MSE (B в) is a decreasing function near А = 0 and 


for this all we need to show is that 9M SE (ô в) /дА|х=о < 0. Taking this derivative we 
find that 
OMSE G ) т 2 
— = —20? >; x « 0. (9.69) 
A=0 
(vi) The calculations for (9.54) are similar to those in proving (v) and are left as an 
exercise. M 


9.5.8 Choosing the Ridge Parameter 


Each of the various properties of the ridge estimator that we have discussed so far has 
motivated one or more methods for estimating A. Generally, these attempt to do one or 
more of the following: 


(i) Reduce the variance and/or stabilize the coefficient estimates relative to OLS. 


(ii) Produce an estimator with a smaller MSE or prediction mean squared error than 


OLS. 


As part (v) of Theorem 9.1 shows, there exists a value of the ridge parameter А which 
gives an estimator B g With smaller MSE than for B. Ideally, the best estimator from this 
point of view would be the one that minimizes the MSE. However, as (9.69) shows this 
minimum value would generally depend on 8 and о? which are unknown and so such an 
estimator cannot be implemented in practice. One way out would be to work iteratively 
by first estimating @ and c? in (9.54), say by OLS, finding the value of А that minimizes 


the resulting estimate of MSE B n] and repeat until these estimates appear to have 


converged. However, even if this were done, we would still have no guarantee (and we 
know of no proof) that the resulting estimator would have smaller MSE than the OLS 
estimate 8. 

Typically, then, each procedure chosen yields a different value of А, with these val- 
ues differing substantially for any given problem depending on the method used. As a 
consequence, at the current stage of development of the subject it is not possible to say 
which procedure is “best” or even if any generally provide estimators which are better 
than OLS. However, since such methods are in widespread use, we will discuss a number 
of those which seem to have shown to have the most promise. Here we have to keep in 
mind that the relative merit of the various techniques have for the most part been studied 
by simulation, since very little is known theoretically. At the same time remember that 
some authors have found methodological faults with these studies as well. 

The methods for choosing А fall into two broad categories: stochastic and nonsto- 
chastic. Stochastic methods use the observed values y in some way so that the resulting 
A's are actually random variables. Nonstochastic methods make use only of the design 
matrix X. We begin with a discussion of stochastic methods, since these have proved to 
be the most popular in practice. 
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Stochastic Methods for Choosing A 


(1) The Ridge Trace This method, introduced by Hoerl and Kennard [58, 59] is per- 
haps the oldest and apparently the most recommended method for choosing A although 
virtually no formal evidence can be found in its favor. Here the ridge estimator в (А) 
is calculated for a sequence of values of А, typically, 0 < А < 1, and the individual coeffi- 


cients of др (A) are plotted against А. As observed previously, (B a), 8 (a) < (8,8), 


A > 0, and Bp (А) — 0, А — oo, so we expect that these plots will show a rapid decline 
in the magnitude of these coefficients as А increases and may tend to stabilize at some 
values of А which is usually determined by visual inspection. Typical behavior is shown 
in Figure 9.1. 


B) 


Figure 9.1: Graph of ridge trace 


This method is highly subjective, since it is not exactly clear what value of А represents 
stability and the fact that different plots stabilize at different points. (Perhaps plotting 


(B nA), Br Q2) would be a better choice in this regard but it seems not to be done.) 


In addition, some authors have argued that the “exact” value of A is not that important. 
But simulation studies have shown that even small changes in А can produce rather large 
changes in MSE, so it is not clear that one can be so cavalier. 


Example 9.4 We consider the Hald data to illustrate the ridge trace procedure for 
different biasing constants А. From the original model (m = 4,n = 13), let us consider 
the regression in correlation form. That is, the new centered and scaled regressors z; are 

zi = LE, j =1,2‚,...‚т,‚1 =1,2,..‚п (9.70) 
Sjj 


where S;; = ae, (Zij — T) ‚ and take y to be only centered. Then, the m x m matrix 


ZTZ is the correlation matrix of the x’s, namely 


1.00000 0.22858 —0.82414 —0.24545 
0.22858 1.00000  —0.13924  —0.97296 
—0.82414  —0.13924 1.00000 0.02954 
—0.24545 | —0.97296 0.02954 1.00000 


ZZ = 


ә 


the mean and the standard deviation for each predictor variable is 


тү = 7.4615, 2) = 48.154, тз = 11.769, т = 30.000, 
sı = 5.8824, 89 = 15.561, s3=6.4051, 84 = 16.738. 
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Also, the mean of y, 7 equals 95.423. 
To obtain the ridge solution for the Hald data, we need to solve the equation 


(Z7Z + M) Bp = 27у 
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(9.71) 


for various values of А, 0 € А € 1. The ridge standardized coefficients for selected values 
of А are presented in Table 9.1. 


Table 9.1 Standardized Ridge Coefficients for A 
A BANAN 62 (А) 


.000 
.002 
.004 
.006 
.010 
.020 
.050 
.100 
.500 
1.000 


31.6064 
28.7892 
27.8876 
27.3958 
26.8001 
25.9251 
24.2839 
22.4072 
16.0618 
12.8080 


27.4984 
20.3721 
18.3384 
17.3886 
16.5006 
15.8039 
15.5514 
15.6238 
14.6060 
12.6889 


дз (A) 

2.2602 
—0.8305 
—1.7883 
—2.2900 
—2.8629 
—3.6179 
—4.8362 
—6.0297 
—8.0746 
—T.5705 


B4 (A) 
—8.3552 
—15.8571 
—17.9885 
—18.9766 
—19.8849 
—20.5432 
—20.5395 
—19.9286 
—16.2726 
—13.5439 


In (9.71) when A = 0, we obtain the standardized least squares estimates of 9. In order 


AT ^ * А 
to convert them to the original estimates 3 = (Bo, B5 «s Brn) we use the following 


conversion formula 


Во = 7 


j=l 


у Sjj , j= 1,2, m 
Убу. 


(9.72) 


(9.73) 


Using (9.72) we obtained the estimated ridge coefficients which are shown in Table 9.2. 
As we noted, the ridge coefficients for А = 0 coincide with the regression coefficients in 
the standard fit. In order to examine the behavior of the ridge coefficients, a graph of 
the ridge trace is shown in Figure 9.2. 


Table 9.2 Estimated Ridge Coefficients for A 


À 
.000 
.002 
.004 
.006 
.010 
.020 
.050 
.100 
.500 

1.000 


EARS 
1.55106 
1.41281 
1.36857 
1.34443 
1.31520 
1.27226 
1.19172 
1.09962 
0.78822 
0.62854 


B» (A) 
0.51013 
0.37793 
0.34020 
0.32258 
0.30611 
0.29318 
0.28850 
0.28984 
0.27096 
0.23539 


Bs (A) 

0.10187 
—0.03743 
—0.08060 
—0.10320 
—0.12903 
—0.16306 
—0.21797 
—0.27176 
— 0.36392 
—0.34120 


B4 (А) 
—0.14410 
—0.27348 
—0.31024 
—0.32728 
—0.34295 
—0.35430 
—0.35424 
—0.34370 
—0.28065 
—0.23359 
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0.0 0.5 1.0 
lamda (À) 


Figure 9.2: Ridge trace for the Hald data using four regressors 


As we see from the graph and Table 9.1, the desirable value of А can be chosen 
between 0.01 and 0.02. 

The ridge trace is a very sensible and pragmatic way of choosing the shrinkage para- 
meter A. Since as А gets bigger, the variance reduces, and the coefficients become more 
stable. A value for А is chosen at the point for which the coefficients are no longer chang- 
ing rapidly. However, it should be noted that stability does not imply that the regression 
coefficients have converged. 

In addition to this, we can think of a couple of other plausible criteria to look at. 
One is the values of the VIF - values near 1 are desirable, and an other is the coefficient 
of multiple determination R? for each value of A. 


(2) The Harmonic Mean Estimator А simple, objective estimator for А was given 
by Hoerl, Kennard and Baldwin in [62]. They proposed using 


Â = ms? / (2, B) (9.74) 


where В is the least squares estimator of {3 and s? is the usual estimate of variance in 
the model. This choice of А can be motivated by considering the harmonic mean of 
the optimal ridge parameters in generalized ridge regression. (This derivation will be 
discussed in Section 9.5) and as measured by reduction in MSE was one of three best in 
the simulation study of ten different estimators by Galarneau-Gibbons [39], although a 
previous study by Gunst and Mason [46] had a large standard deviation and could be 
quite variable in shape. Or the other hand, it is easy to calculate, has appealing statistical 
interpretations and requires no subjective judgements on the part of the analyst. 

A possible improvement of this estimator was suggested in Hoerl and Kennard [61]. 
There А was defined as the assumed limit of the sequence 


^ 


Aia = т? / (Br (А) Bn (А), 3 20, (9.75) 


where Ag is given by (9.74). A complicated rule for terminating the iteration is given 
in [61] while Galarneau-Gibbons terminated the iterations when |А; + — А;| < 1074 and 
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defaults to the least squares estimator if convergence is not obtained in 30 iterations. 
This rule also performed well in her study, but not always better than the noniterative 
version. 


Example 9.5 From the previous Example 9.4, since the number of regressors in the 
model m = 4, the standard error from the standard fit s = 2.446 and the standarized 
least squares estimates of [З from Table 9.1 are 


B (0) = (81, Bo B3, 84) = (31.6064, 27.4984, 2.2602, —8.3552). 


Using (9.74) a reasonable choice of А is 
Ў = ms? А (2° 8) = 4(2.446)? /1830.044 = 0.0130771. 
Thus, when we choose А = 0.0131, 


AT 
Bg = (26.4778, 16.1652, —3.1528, —20.2158) , 
and using (9.72)-(9.73), we have the resulting fitted ridge regression model 
Gr = 83.414 + 1.3021 + 0.3022 — 0.142123 — 0.348724. 


We note that in this model the sign of 74 is now negative, ү, is bigger, and both 3, and 
В, are smaller than the ones in the original model. 


(3) SRIDG This estimator of А is based on the observation that the value of А which 
minimizes MSE ( ĝ n] is given as the solution to the equation (obtained by differentiating 
the MSE in (9.63) with respect to A) 


ЦА 2 


2 _ 
5. ae — Q0. (9.76) 
i=1 i 


Estimating c? in (9.68) by s? from a preliminary least squares fit and y; by 
4,00) = [FRA], 1 «i m (9.77) 
an approximation À to the value of A satisfying (9.76) is obtained by evaluating 
[s (A)| = 


У XY, О) — s (9.78) 


— (QA 


for a range of values of А and choosing А as the value which minimizes |s (A)|. This 
estimator was also one of the three best in [39]. 
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(4) Bayes Estimators A number of stochastic estimators have been based on a 
Bayesian interpretation of the ridge estimator. In this context, we assume that each 


regression coefficient has a prior distribution which is N (0,03) and then the Bayes 
estimator др of is 
Bg = (XTX + 02/05)‘ XTy (9.79) 
which may be interpreted as a ridge estimator with ridge parameter А = 07/03. Re- 
placing с? and 03 by various estimators gives a whole series of methods for estimating 
А. 
For instance, if о5 = E ((6, 8)), (remember now that {3 is а random vector) c2 is 
estimated by 
a2 iar las 
63 = —$ B; = — (B, B) (9.80) 
where B,’s are the OLS estimators of 8;, and s? is the usual estimate с? then А may be 
estimated by 


j= 5 = ст (9.81) 


which is just the harmonic mean estimator. A similar estimator, А = (т + 2) s?/ (8, 8) 


and iterative version were proposed by Lindley and Smith in [75]. Further examples of 
estimators derived from a Bayesian point of view may be found in [84, 105]. 


(5) Constrained Estimators As we have noted previously, the ridge estimator may 
be viewed as a constrained least squares estimator with the goal of ridge estimation 
to produce "shorter" estimates of than 8. Using this as motivation McDonald апа 
Galarneau [79] developed a method for estimating А based on (9.24) which shows that 


E ((8.8)) = (8,8) «e? Y; 1. (9.82) 
i=1 °* 
Estimating E ((8,й)) by (B m B в) and 8 by 8, they suggest choosing А so that 
(в.д) = (8,8) - 3? + (9.83) 
i=1°* 


which will constrain (8 m B в) to be smaller than (2,8), provided that (9.82) has a 


solution. If (9.82) is not solvable, i.e., the right hand side is negative then they considered 
using А = 0 (i.e., the least squares estimator or А = oo and Bp = 0. In their simulations 
[79] neither method proved better (in terms of MSE) than least squares in all cases. 


(6) PRESS(\) If improvement in prediction rather than fit is the primary goal of one’s 
analysis, then methods which estimate А based on prediction performance statistics can 
be used. A straightforward approach along these lines is to use a generalization of the 
PRESS statistic discussed in Chapter 8. 
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Letting 3 в(—{) (A) denote the ridge estimate of 3 with the i-th observation deleted 
and йу (A) = (B R(-i) (A) 3x) the corresponding predicted value at x;, then we define 
PRESS(A) by 

ш; 2 
PRESS (X) = 5 ' [yi 9-0 (А)] (9.84) 
і=1 
and note that PRESS(0) = PRESS. Choosing А to minimize PRESS(A) then yields a 
ridge estimator, which hopefully has “better” prediction properties than 3. (However, 
keep in mind our comments concerning prediction in Section 9.5.) 

Unfortunately, if А > 0, no simple expression for PRESS(A) exists analogous to that 
for PRESS(0). Thus a large amount of computation is necessary to evaluate this statistic 
and to search for the minimizing value of А. An approximation to PRESS(A) which is 
cheaper to compute may be found in [90]. 


(7) Cy Statistic Using a similar prediction optimization philosophy Mallows in [82] 
introduced a generalization of the Cp statistic as a tool for estimating А. He proposed 


plotting 
SSE (А) 


Ch = "um = АП! 2+ 2[tr (XL) (9.85) 


where L = (XTX + AL,) ce ХТ, against 
Уу =4 [tr (x7 хы) (9.86) 
and choosing А to minimize С. 


(8) Generalized Cross Validation Estimator Perhaps the most popular prediction 

based statistic for choosing \ is Wahba, Golub and Heath’s generalized cross validation 

(GCV) statistic [116] which they motivate as a rotation invariant version of PRESS()). 

It is defined by 

ла - HQ) yl? 
(tr (Im — H QI" 


where Н (A) = X (XTX + Alm) ~ XT is the ridge generalization of the hat matrix since 
Y (4) = H (4) y gives the vector of predicted ridge estimates of Y at the observation 
points x;,1 € ? € n. 

If A = 0, then, 


GCV (A) (9.87) 


2 
lI. -H)yl| | X SSE __55Е _„ (9.88) 


GCV (0) = = 
O= атон) (n- XX. ha) (n – т)? 


where h,; is the i-th diagonal element of X, (Хх) ' ХГ. In this case we notice 
that GCV(0) # PRESS(0) and for \ = 0, GCV is actually a measure of fit rather than 
of prediction. 

As for PRESS(A) А is chosen to minimize GCV(A). This statistic has fared well in 
simulations [39] and has been widely used in nonstatistical applications of ridge type 
procedures. 
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Nonstochastic Estimators 


Nonstochastic estimators of А appear to be less used in practice. As with the ridge trace, 
they are somewhat subjective. We mention only a few. 


(1) Reduces VIFs If one examines the formula for У (д в) it is easy to show that 


the variance inflation factors of the ridge estimators Зр (A) are decreasing functions of 
А. If improved estimation is one’s goal, then a simple procedure is to choose A so that all 
the VIFs are between one and ten. The smallest value of А which does this is preferable, 
for then the bias is minimized as well (see [8]). 


(2) tr[H (4)) In [90] Myers proposes examining tr[H (А)| which may be regarded as the 
effective number of regression degrees of freedom in the problem. From the expression 
hii (Ai) / (А; + A) which are the diagonal elements of Н (A) 


^i 
itr’ 


DF (A) = tr [H (A)] = 2. x (9.89) 


one sees that DF(A) is a decreasing function of A with DF(0) = m. As with the ridge 
trace, one can generally expect a rapid fall off of DF(A) in a plot of DF(A) against А and 
then stabilization. 

The ridge parameter A is chosen as the point where this stabilization occurs. The 
value of DF(A) where this occurs is referred to as the effective rank of Xs, [112]. 


9.5.4 Generalized Ridge Regression 


In ridge regression one is attempting to control the stability of the coefficient estimates 
through the use of a single parameter. Since there is no reason to believe that a single 
value of A can stabilize all the estimates simultaneously, it is perhaps more reasonable 
to try to use m parameters to control each coefficient separately. This idea leads to the 
notion of generalized ridge regression which we discuss next. 

We begin by examining the effect of ridge regression on the canonical form of the 
model and this will lead to a straightforward generalization. From (9.41) 


Bp = (XTX+AL,) X7XB (9.90) 


and using the spectral decomposition of XTX this gives (using QT = Q7!) 


Ôr = (QAQ™+Mn) QAQ"B = Q(A +)! AQ™B (9.91) 


so that К _ 
ОТ@к = (А+ А.) А978. (9.92) 


If we let 4р = QT 8р and noting that Q7B = 4, the least squares estimator Ẹ in the 
canonical form of the linear model (recall Example 5.11), then (9.92) becomes 


Ag = (А + Ma) AY (9.93) 
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so that 
YR,i ст А+ A: Е 
Since Var (4;) = о?2/А; and 
А; 2 0? A;c? 
V ^ QJ] = t — = TRO DNE $ 


the effect of ridge regression is to reduce the variance of 4; by a factor of A2/ (А + Му" 
which decreases to zero аз А — oo. Thus, ridge regression counteracts the effects of small 
eigenvalues on the canonical estimates by multiplying the least squares estimators by a 
filter of the form А;/ (А + Xj). Because there is generally no a priori reason to use a 
common filter for each coefficient, we can consider replacing А by a different value и; for 
each component and this leads one to define generalized ridge estimators for y by Yor; 


where XA 
iYi ; 


The generalized ridge estimator for 8 is then given by 
Bor = Qon. (9.97) 


As for ordinary ridge regression, we are now faced with the task of choosing p;,1 < 
i < m. As before, a reasonable approach is to determine these to minimize the mean 
square error of Gril X i < m. Now, 


MSE (fcr) = [E (сва) — 1] + Var (far) (9.98) 
and from (9.98) we find that E (fara) = №:/ (№ + и;) (since 4; is unbiased) and 
Var (ср) = №0?/ (№ + pe) so that 


E 2; y. Y 4+ o?X; 
MSE (Ас) = BUM P Quy) +о № ae L, (9.99) 
(Aitu) (Ai + pi) (Ai + ui) 
Using standard calculus arguments it can be shown that 
p; = o^ In, (9.100) 


minimizes (9.99). Since y; and о? are unknown, these are typically replaced by their 
OLS estimates 4; and s? leading to stochastic estimators of ji; 


Д = 82/42, 1<і< т. (9.101) 


These may then be used in place of и; in (9.100) and then Ó is estimated Бу (9.97). 
An alternative approach to estimating и, is to attempt to minimize MSE (Bc в). 
However, using calculations similar to those in Theorem 9.1. 


MSE (Bc в) = V MSE (^an) (9.102) 
i=1 
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and minimizing (9.102) leads to the same values of и; given in (9.100). Thus both 
approaches give the same result. 

As for ordinary ridge regression iterative versions of the estimators ~;,1 <i < m, 
have been considered [58]. These are defined by the sequence 


^. gt JA? 
Hj = 8 Piia 9.103 
| Viger МЫ (А; + b) , Í = 0,1,2,... ( 


where 4;0,1<7 < m, is the least squares estimator of y;. 8 
sequence 


is then estimated by the 


i 


^ 


Bi =ч (93;),, і = 1,2,...,т, 1 = 0, 1:2, oe (9.104) 


where 4; = (51,55 92,5; ag) 

The iteration is terminated when the lengths of successive iterates 4;, ў; are ap- 
proximately the same. In this regard, it is interesting to note that Hemmerle in [54] gave 
a closed form solution for шп; Bi; so that carrying out the iterations in (9.103) is 
actually unnecessary. 

In particular, let T be the m x m diagonal matrix diag(71,72,..., Т) where 


i= | s 2534/2 ls? к (9.105) 
1/2 + [1/4 — 52/5] ^, otherwise 
then 
n ad = (T4); (9.106) 
and so 
mr B, = (QTY), . (9.107) 


If we observe that (42А; / s^ is the t-statistic associated with the least squares es- 
timator 4;, then the fully iterated generalized ridge estimator (9.103) says to estimate 
"y; by zero if the observed t statistic is not significant at about the 5% level, otherwise 
ү; is estimated by the function of the least squares estimator given іп (9.106). Thus, 
“nonsignificant” least squares estimates are shrunk to zero, while the remaining ones are 
shrunk less drastically. 

It was observed by Hemmerle [54] that (9.103) often induced too much shrinkage 
(bias) in the estimates of у; and he and others have proposed various modifications of 
this procedure. (Further details may be found in [55].) We also note that these estimators 
were not evaluated in the simulation study of Gibbons [39]. 

As a further observation on generalized ridge regression we note, as stated in Section 
9.5 that the harmonic mean estimator of А in ordinary ridge regression may be viewed as 
the harmonic mean of the generalized ridge parameters fi; = s? jae For this, we observe 
that by definition, the harmonic mean of f, 1 € i < m, is given by 


mE GT т, ms? 
Tm R = rr A =— > = 
к, (1/Д) De ss (4;/s?) а oF 
2 2 2 
x A "S с —— = ML) (9.108) 
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As a last comment on generalized ridge regression we observe that another choice of 
parameters is often made. That is, if Àj < А <--- < Am (this is the reverse order from 
what we have been using), then one chooses и; = oo,1 < i < т, апа p; = 0,r-1 € i < т, 
where Ау, А, ..., А are the r smallest eigenvalues of XTX. Often, as in the study of Gunst 
and Mason [46], r — 1. 

This gives rise to a class of generalized ridge estimators called principal components 
estimators. Denoting these estimators with the sub/superscript “рс” we have 


0 1=1,2,...,т 
APC _. | 343 ee BAE) 
н -{ ^, t=r+1r+2,....m. (9.109) 


The corresponding estimator of 3 is given by 


By. = Qe (9.110) 
where 4?° = (42°, 42^, ..., 3Р0). 
Further discussion of these estimators will be given in the next section. 


9.6 Other Alternatives to OLS 


Ridge estimation is by no means the only alternative to OLS that has been proposed to 
improve estimation and prediction when multicollinearity is present. Here we discuss two 
further approaches, mized estimation and principal components regression. A description 
of other methods, such as latent root regression, may be found in [51]. 


9.6.1 Mixed Estimation 


This is à quasi-Bayesian approach for solving ill-conditioned problems, which uses an a 
prior constraint on 8 as additional phoney data to improve the quality of estimation. 

Suppose then that we have a standard linear model Y = ХВ += and assume that we 
have a set of p < m prior constraints оп 8 which can be written in the form 


2= 08+ (9.111) 


where for simplicity we have E (ô) = 0 and X (ô) = 021, and, in general, с? # о?. Then 
augmenting the basic model with this information we get 


E(B) om 


If € and 6 are assumed independent, then the mixed estimator B м Of B is the generalized 
least squares estimator obtained from (9.111) and is given by 


5 ; (9.113) 


à, = [ХХ „ PTD CIXTy Рт, 
MU gs p? с? ge | 


and observe that if DTD = Im and z = 0 then 8 м becomes the ridge estimator 


Bp = (XTX + Am) Xy (9.114) 


408 CHAPTER 9. MULTICOLLINEARITY: DIAGNOSIS AND REMEDIES 


where А = с2/02 and is identical to the normal theory Bayes estimator. Again this 
result emphasizes that ridge estimation is OLS with constraints imposed on {3 via prior 
information. 

A numerical example, taken from economic theory may be found in BKW [8]. 


9.6.2 Principal Components Estimation 


As we pointed out in our discussion of generalized ridge regression, the ridge parameters 
can be chosen to make the coefficients 4,, 75, ..., 4. (4; is the least squares estimate of 
"y; in the canonical model) corresponding to the т smallest eigenvalues of ХГХ equal to 
zero. Doing this leads to the principal components estimators. 


Вя ovest um (9.115) 

where ч 
At) (0/0 55 eias ЛОТ d) (9.116) 
where the eigenvalues of XTX have been ordered as А < А € :-- < Am with the 


columns of Q being ordered in a corresponding fashion. 
For 1 < т € m, B, is generally biased with 


E (Bnc) = Q272 (9.117) 
T T 
a Г, where y, = (Vies) and y, = 
(Yrtis1%m) and Q is partitioned as[Q1|Q2] to conform with this so that the columns 
of Q» are the eigenvectors corresponding to А, р1,..., А. 


Thus, the bias is given by 


where 4 is partitioned so that y = [wil 


Bias = B – Өгүз = Qin (9.118) 


and X [Bye (r)| is given by 


E [B,..(r)] = Q22 42) QF = 07Q2Az'Q3 (9.119) 


where Ag = diag(A,41, ..., Ал). 
From (9.117) and (9.118) or (9.119) 


T 


m 2 
MSE Be 0] =+ Y = (9.120) 
j=l ј=т+1 7 


In order to make PC estimation operational, a choice of r must be made which 
is analogous to selecting the ridge parameter(s) in ridge regression to minimize MSE. 
That this is theoretically possible comes from examining (9.120) where it is seen that 
the bias component increases as r increases, whereas the variance component decreases. 
However, since y;, 1 < i < m, are unknown in practice, as for ridge regression, this choice 
of r cannot be made a priori. Because of this, a number of different selection philosophies 
are currently used. 
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One approach is to use variable selection techniques, such as adding components 
starting with r = m — 1 until some measure of fit such as R? stabilizes. Another ap- 
proach is to use backward elimination starting with the full model and eliminating "y;'s 
on the basis of the standard t-tests. This, latter approach was criticized by Gunst and 
Mason [46] as a consequence of the poor showing of PC estimators this procedure gave 
in the simulation study of Dempster, Schatzoff and Wermuth in [26]. They use (but not 
clearly recommend) the PC estimator with r = 1. That is, they eliminate the component 
corresponding to the smallest eigenvalue А. This seems sensible since it can be shown 
that the principal components estimator Bre (т) is the least squares estimator of 8 con- 
strained by setting the dependencies determined by the r-th smallest eigenvalues to zero. 
So, Bc (1) is determined by minimizing (y — X8, y — X) subject to 


X GB; = 0 (9.121) 
i—1 


where (91,9, ..., 4)? is the eigenvector for Amin (XTX). 

In their simulation studies they found, as measured by MSE, that the PC(1) estimator 
generally outperformed ordinary least squares and ridge regression with the harmonic 
mean estimator (9.75) for the ridge parameter, for ill-conditioned problems, but was 
worse than least squares for nearly orthogonal data. 

Further details may be found in [26]. However, because their simulations were all 
performed with (8, 8) = 1 the same criticism as given by Draper and Van Nostrand [28] 
for ridge regression simulations holds. 


9.7 Exercises 


9.1 Is ridge regression a least squares procedure? 


9.2 Consider the hosing price data in Example 5.12 
(a) Find the sample correlations between the regressor variables. 
(b) What are the variance inflation factors? 


(c) Find the condition number of XTX. Give a comment. 


9.3 Using the birth weight data in Example 5.15, answer the followings. 
(a) Find the sample correlations matrix among regressor variables. 
(b) Calculate the variance inflation factors. 
(c) Find the eigenvalues of XTX. 


(d) Find the condition number of XTX. Give a comment. 


9.4 Consider the Longley data in Example 5.17. 
(a) Find the sample correlation matrix among the regressor variables. 


(b) Using the numbers in (a), give a brief comment about the indication of multi- 
collinearity. 


(c) What are the variance inflation factors? 
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(d) Find the eigenvalues and eigenvectors of XTX. 


(e) Find the condition number of ХГХ. Give a comment for the evidence of mul- 
ticollinearity. 


(f) Find the ridge regression solution for the data. 


9.5 A researcher obtained the following data from an experiment. 


Obs. No. Ү тү T2 


1 39 0 0 
2 44 0 0 
3 45 0 0 
4 42 1 -1 
5 38  —1 1 
6 35 -1 —1 
7 37 1 1 


(a) Fit the data to the model Y = + Духу + 8523 + B11 07 + 8,222 + 8,5122 +E. 
(b) Can you suggest a different model(s) to fit the data? 


9.6 Consider the data below for the model Y = 8, + 8,21 + 8522 + €. 


Obs. No Y тү T2 
1 10 -2 -4 
2 6 —-1 -1 
3 6 0 0 
4 12 1 1 
Б 14 2 4 
Sum 48 0 0 


Sum of Squares 512 10 34 


(a) Center and scale the z1, £2 and Y columns. 
(b) Write out the normal equations for (a) and solve them. 


(c) Find the determinant of the correlation matrix. 
9.7 Prove Theorem 9.1 (vi). 


9.8 Prove that the ridge estimator is the solution to the problem 
XE à 
min (8 z 8) XTX (8 = 8) subject to BTB < d?. 


9.9 Stein [109] proposed the pure shrinkage estimator defined as B, — cB, where 
0 € c € 1, and c is a constant chosen by the experimenter. Show that the pure 
shrinkage estimator is the solution to the problem 


min (8 — 8) (8 — 8) subject to BTB < 4. 
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9.10 Consider the following data set given in below. 


y Tı T2 T3 T4 
11.7 10.0 12125 132.2 404.6 
17.9 11.5 36717 501.5 1180.6 
21.1 11.6 43319 904.0 1807.5 
14.7 11.2 10580 227.6 470.0 

Т.Т 10.7 3931 66.6 151.4 

8.4 10.0 1536 43.4 93.8 
32.8 6.8 61400 1253.0 3293.4 
17.6 8.8 2589 83.1 158.2 
10.9 8.5 1186 24.2 96.2 

9.2 7.7 291 4.5 31.8 
16.2 4.9 1276 9.1 95.0 
10.1 9.6 6633 158.2 407.2 


(a) Find the eigenvalues and eigenvectors of scaled Х matrix (not centered). 


(b) What is the condition number of XTX? Give a comment for the evidence of 
multicollinearity. 


(c) Find the variance inflation factors for the regression coefficients. 
(d) Draw the ridge trace for the data and find the ridge regression solution. 


(e) Do you have any suggestions to alleviate the multicollinearity? 
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Table A.1 Standard Normal Distribution 


P(Z < 2) = Q(z) = [_eexp(-4? )ах, 


Ф(-2) = 1 - (z) 


APPENDIX 


Note: Entry represents the area under the standard normal distribution from —oo to z. 


.00 
.5000 
.5398 
.5793 
.6179 
.6554 


.6915 
‚7257 
‚7580 
‚7881 
‚8159 


‚8413 
‚8643 
‚8849 
‚9032 
‚9192 


‚9332 
9452 
9554 
.9641 
.9713 


9772 
.9821 
.9861 
.9893 
.9918 


.9938 
.9953 
.9965 
.9974 
.9981 


.9987 
.9990 
.9993 
.9995 
‚9997 


‚01 


.5040 
.5438 
.5832 
.6217 
.6591 


.6950 
.7291 
7611 
7910 
.8186 


.8438 
.8665 
.8869 
.9049 
.9207 


.9345 
‚9463 
‚9564 
9649 
‚9719 


‚9778 
‚9826 
‚9864 
‚9896 
9920 


‚9940 
‚9955 
‚9966 
‚9975 
‚9982 


‚9987 
‚9991 
‚9993 
9995 
‚9997 


‚02 


.5080 
.5478 
.5871 
.6255 
.6628 


.6985 
.7324 
.7642 
‚7939 
‚8212 


‚8461 
.8686 
.8888 
.9066 
.9222 


.9357 
.9474 
.9573 
.9656 
.9726 


.9783 
9830 
.9868 
.9898 
.9922 


.9941 
.9956 
.9967 
.9976 
.9982 


.9987 
.9991 
.9994 
.9995 
‚9997 


Second Decimal Place of z 


.03 


.5120 
‚5517 
.5910 
.6293 
.6664 


.7019 
‚7357 
7673 
7967 
.8238 


.8485 
.8708 
.8907 
.9082 
.9236 


.9370 
.9484 
.9582 
.9664 
.9732 


.9788 
.9834 
.9871 
.9901 
.9925 


.9943 
.9957 
.9968 
9977 
9983 


9988 
‚9991 
‚9994 
‚9996 
‚9997 


.04 


5160 
‚5557 
5948 
.6331 
.6700 


.7054 
.7389 
.7704 
.7995 
.8264 


.8508 
.8729 
.8925 
.9099 
.9251 


.9382 
.9495 
.9591 
.9671 
.9738 


.9793 
.9838 
.9874 
.9904 
‚9927 


9945 
9959 
.9969 
‚9977 
9984 


‚9988 
9992 
‚9994 
.9996 
‚9997 


.05 


‚5199 
‚5596 
5987 
.6368 
.6736 


.7088 
7422 
7734 
.8023 
.8289 


.8531 
.8749 
.8944 
.9115 
.9265 


.9394 
.9505 
.9599 
.9678 
.9744 


.9798 
.9842 
.9878 
‚9906 
9929 


.9946 
.9960 
.9970 
.9978 
.9984 


.9989 
‚9992 
9994 
‚9996 
‚9997 


.06 
0239 
‚5336 
.6026 
.6406 
.6772 


7123 
7454 
7764 
.8051 
.8315 


‚8554 
‚8770 
8962 
9131 
9279 


9406 
9515 
9608 
‚9686 
‚9750 


‚9803 
.9846 
.9881 
.9909 
.9931 


.9948 
.9961 
‚9971 
‚9979 
‚9985 


‚9989 
‚9992 
9994 
.9996 
‚9997 


‚07 


:9279 
.5675 
.6064 
.6443 
.6808 


7157 
7486 
‚7794 
‚8078 
‚8340 


‚8577 
‚8790 
‚8980 
‚9147 
‚9292 


‚9418 
‚9525 
‚9616 
‚9693 
‚9756 


‚9808 
‚9850 
‚9884 
‚9911 
9932 


‚9949 
‚9962 
‚9972 
‚9979 
9985 


9989 
‚9992 
‚9995 
‚9996 
‚9997 


.08 


.5319 
.5714 
.6103 
.6480 
.6844 


.7190 
4517 
.7823 
.8106 
.8365 


.8599 
.8810 
.8997 
.9162 
.9306 


.9430 
‚9535 
9625 
.9700 
‚9762 


9812 
9854 
‚9887 
‚9913 
‚9934 


‚9951 
‚9963 
‚9973 
‚9980 
‚9986 


‚9990 
‚9993 
‚9995 
‚9996 
‚9997 


‚09 


5359 
‚5753 
‚6141 
‚6517 
.6879 


‚7224 
‚7549 
7852 
.8133 
.8389 


.8621 
.8830 
.9015 
.9177 
.9319 


.9441 
.9545 
.9633 
.9706 
.9767 


.9817 
.9857 
.9890 
.9916 
.9936 


‚9952 
‚9964 
‚9974 
‚9981 
‚9986 


‚9990 
‚9993 
‚9995 
‚9997 
‚9998 


Table A.2 Percentiles of the Student's t-Distribution 


»T«2- f 


r[(r«1)/2] 


(r+1)/2 dx , 
79 Jarr (r/2)(1+x°/r) 


P(T <-)=1-P(T<d) 
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Note: Entry is the value of t. For lower percentiles, use the relation ¢, = —t_,.. In particular, 


t so = —t so = 0. For example, for degrees of freedom r = 8, /95 = — to5 = 1.860. 


d.f 


ч 


— 
о © 00-J)O0 tn њо hl) — 


0.75 

1.000 
.816 
‚765 
‚741 
‚727 


‚718 
711 
‚706 
‚703 
‚700 


‚697 
‚695 
‚694 
‚692 
‚691 


‚690 
‚689 
‚688 
‚688 
‚687 


‚686 
‚686 
‚685 
‚685 
‚684 


‚684 
‚684 
.683 
.683 
.683 


.681 
.680 
.679 
.677 
.674 


0.90 

3.078 
1.886 
1.638 
1.533 
1.476 


1.440 
1.415 
1.397 
1.383 
1:372 


1.363 
1.356 
1.350 
1.345 
1.341 


1.337 
1.333 
1.330 
1.328 
1.325 


1.323 
1.321 
1.319 
1.318 
1.316 


1.315 
1.314 
1.313 
1.311 
1.310 


1.303 
1.299 
1.296 
1.289 
1.282 


0.95 

6.314 
2.920 
2.353 
2,132 
2.015 


1.943 
1.895 
1.860 
1.833 
1.812 


1.796 
1.782 
1.771 
1.761 
1.753 


1.746 
1.740 
1.734 
1.729 
1.725 


1.721 
1.717 
1.714 
1.711 
1.708 


1.706 
1.703 
1.701 
1.699 
1.697 


1.684 
1.676 
1.671 
1.658 
1.645 


0.975 

12.706 
4.303 
3.182 
2.776 
2.571 


2.447 
2.365 
2.306 
2.262 
2.228 


2.201 
2.179 
2.160 
2.145 
2.131 


2.120 
2.110 
2.101 
2.093 
2.086 


2.080 
2.074 
2.069 
2.064 
2.060 


2.056 
2.052 
2.048 
2.045 
2.042 


2.021 
2.009 
2.000 
1.980 
1.960 


Р(Т<„)=р 

0.99 0.995 

31.821 63.657 
6.965 9.925 
4.541 5.841 
3.747 4.604 
3.365 4.032 
3.143 3.707 
2.998 3.499 
2.896 3.355 
2.821 3.250 
2.764 3.169 
2.718 3.106 
2.681 3.055 
2.650 3.012 
2.624 2.977 
2.602 2.947 
2.583 2.921 
2.567 2.898 
2.552 2.878 
2.539 2.861 
2.528 2.845 
2.518 2.831 
2.508 2.819 
2.500 2.807 
2.492 2.797 
2.485 2.787 
2.479 2.799 
2.473 2.771 
2.467 2.763 
2.462 2.756 
2,457 2.750 
2.423 2.704 
2.403 2.678 
2.390 2.660 
2.358 2.617 
2.326 2.576 


0.9975 
127.322 
14.089 
7.453 
5.598 
4.773 


4.317 
4.029 
3.833 
3.690 
3.581 


3.497 
3.428 
3.372 
3.326 
3.286 


3.252 
3.222 
3.197 
3.174 
3.153 


3.135 
3.119 
3.104 
3.091 
3.078 


3.067 
3.057 
3.047 
3.038 
3.030 


2.971 
2.938 
2.915 
2.860 
2.807 


0.999 
318.309 
22.327 
10.214 
7.173 
5.893 


5.208 
4.785 
4.501 
4.297 
4.144 


4.025 
3.930 
3.852 
3.787 
3.733 


3.686 
3.646 
3.610 
3.579 
3.552 


3.527 
3.505 
3.485 
3.467 
3.450 


3.435 
3.421 
3.408 
3.396 
3.385 


3.307 
3.261 
3.232 
3.160 
3.090 


0.9995 
636.619 
31.598 
12.924 
8.610 
6.869 


5.959 
5.408 
5.041 
4.781 
4.587 


4.437 
4.318 
4.221 
4.140 
4.073 


4.015 
3.965 
3.922 
3.883 
3.850 


3.819 
3.792 
3.767 
3.745 
3.725 


3.707 
3.690 
3.674 
3.659 
3.646 


3.551 
3.496 
3.460 
3.373 
3.291 
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Table A.3 Percentiles of the Chi-Square Distribution 


Р(Х <x)= ae le™ du, PX< 2.) =p 
Note: Entry is the value of x (= x Notation: .0* = .0000 

d.f Р(Х < Y = р 

r 0.005 0.025 0.05 0.90 0.95 0.975 0.99 0.995 

1 .0°393 .0°982 .0°393 2.706 3.841 5.024 6.635 7.879 
2 0100 ‚0506 ‚103 4.605 5.991 7.378 9.210 10.597 
3 ‚0717 ‚216 .352 6.251 7.815 9.348 11.345 12.838 
4 .207 .484 .711 7.779 9.488 11.143 13.277 14.860 
5 412 .831 1.145 9.236 11.070 12.832 15.086 16.750 
6 .676 1.237 1.635 10.645 12.592 14.449 16.812 18.548 
7 .989 1.690 2.167 12.017 14.067 16.013 18.475 20.278 
8 1.344 2.180 2.733 13.362 15.507 17.535 20.090 21.955 
9 1:735 2.700 3.325 14.684 16.919 19.023 21.666 23.589 
10 2.156 3.247 3.940 15.987 18.307 20.483 23.209 25.188 
11 2.603 3.816 4.575 17.275 19.675 21.920 24.725 26.757 
12 3.074 4.404 5.226 18.549 21.026 23.337 26.217 28.300 
13 3.565 5.009 5.892 19.812 22.362 24.736 27.688 29.819 
14 4.075 5.629 6.571 21.064 23.685 26.119 29.141 31.319 
15 4.601 6.262 7.261 22.307 24.996 27.488 30.578 32.801 
16 5.142 6.908 7.962 23.542 26.296 28.845 32.000 34.267 
17 5.697 7.564 8.672 24.769 27.587 30.191 33.409 35.718 
18 6.265 8.231 9.390 25.989 28.869 31.526 34.805 37.156 
19 6.844 8.907 10.117 27.204 30.144 32.852 36.191 38.582 
20 7.434 9.591 10.851 28.412 31.410 34.170 37.566 39.997 
21 8.034 10.283 11.591 29.615 32.671 35.479 38.932 41.401 
22 8.643 10.982 12.338 30.813 33.924 36.781 40.289 42.796 
23 9.260 11.689 13.091 32007 35.172 38.076 41.638 44.181 
24 9.886 12.401 13.848 33.196 36.415 39.364 42.980 45.558 


25 10.520 13.120 14611 34.382 37.652 40.646 44.314 46.928 
26 11.160 13.844 15.379 35.563 38.885 41.923 45.642 48.290 
27 11.808 14.573 16.151 36.74] 40.113 43194 46.963 49.645 
28 12.461 15.308 16.928 37.916 41.337 44.461 48.278 50.993 
29 13.121 16.047 17.708 39.087 42.557 45.722 49.588 52.336 
30 13.787 16.791 18.493 40.256 43.773 46.979 50.892 53.672 


35 17.192 20.569 22.465 46.059 49.802 53.203 57.342 60.275 
40 20.707 24433 26.509 51.805 55.758 59.342 63.691 66.766 
50 27.991 32.357 34.764 63.167 67.505 71.420 76.154 79.490 
60 35.535 40.482 43.188 74.397 79.082 83.298 88.379 91.952 
70 43.275 48.758 51.739 85.527 90.531 95.023 100.425 104.215 
80 51.172 57.153 60.391 96.578 101.879 106.629 112.329 116.321 
90 59.196 65.647 69.126 107.565 113.145 118.136 124.116 128.299 
100 67.328 74.222 77.929 118.498 124.342 129.561 135.807 140.169 


Source: [73] Abridged from Table 9 of Kokoska, S. and Nevison, C. (1989), Statistical Tables and 
Formulae, Springer-Verlag, New York. 


Table A.4a  F-Distribution - Critical values of upper 10% points 


P(X < = 


/ T[(vi*v3)/2 (vi [vt 2.28 
0 TI(w/20(v;/2)(1*viu/v; Дт"? 


du, 


P(F > fui,v2;1-a) =a=.10 
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Note: F = 512/52 = [S;/v,]/[S2/v2] where s;? and s,” are independent mean squares estimating 
common variance с? and based on v, and v; degrees of freedom, respectively. 


d.f 


© © сос лБ ом |у 


60 
120 


со 


3.01 
2:99 
2.97 


2.96 
2.95 
2.94 
2.93 
2.92 


2.91 
2.90 
2.89 
2.89 
2.88 


2.84 
2.79 
2.75 
2.17 


2.62 
2.61 
2.59 


2.57 
2.56 
2.55 
2.54 
2.53 


2.52 
2.51 
2.50 
2.50 
2.49 


2.44 
2.39 
2.35 
2.30 


3 
53.6 
9.16 
5.39 
4.19 
3.62 


3.29 
3.07 
2.92 
2.81 
2.73 


2.66 
2.61 
2.56 
2.52 
2.49 


2.46 
2.44 
242 
2.40 
2.38 


2.36 
2.35 
2.34 
2.33 
2.32 


2.31 
2.30 
2.29 
2.28 
2.28 


2.23 
2.18 
2.13 
2.08 


4 
55.8 
9.24 
5.34 
4.11 
3.52 


3.18 
2.96 
2.81 
2.69 
2.61 


2.54 
2.48 
2.43 
2.39 
2.36 


2.33 
2.31 
2.29 
2.27 
2.25 


2.23 
2.22 
2.21 
2.19 
2.18 


2.17 
2.17 
2.16 
2.15 
2.14 


2.09 
2.04 
1.99 
1.94 


5 
57.2 
9.29 
5.31 
4.05 
3.45 


3.11 
2.88 
2.73 
2.61 
2.52 


2.45 
2.39 
2.35 
2.31 
2.27 


2.24 
2.22 
2.20 
2.18 
2.16 


2.14 
2.13 
2.11 
2.10 
2.09 


2.08 
2.07 
2.06 
2.06 
2.05 


2.00 
1.95 
1.90 
1.85 


6 
58.2 
9.33 
5.28 
4.01 
3.40 


3.05 
2.83 
2.67 
2.55 
2.46 


2.39 
2.33 
2.28 
2.24 
2.21 


2.18 
2.15 
2.13 
2.11 
2.09 


2.08 
2.06 
2.05 
2.04 
2.02 


2.01 
2.00 
2.00 
1.99 
1.98 


1.93 
1.87 
1.82 
1.77 


7 
58.9 
9.35 
5.27 
3.98 
3.37 


3.01 
2.78 
2.62 
2.51 
2.41 


2.34 
2.28 
2.23 
2.19 
2.16 


2.13 
2.10 
2.08 
2.06 
2.04 


2.02 
2.01 
1.99 
1.98 
1.97 


1.96 
1.95 
1.94 
1.93 
1.93 


1.87 
1.82 
1.77 
1.72 


8 
59.4 
9.37 
5.25 
3.95 
3.34 


2.98 
2.75 
2.59 
2.47 
2.38 


2.30 
2.24 
2.20 
2.15 
2.12 


2.09 
2.06 
2.04 
2.02 
2.00 


1.98 
1.97 
1.95 
1.94 
1.93 


1.92 
1.91 
1.90 
1.89 
1.88 


1.83 
1.77 
172 
1.67 


9 
59.9 
9.38 
5.24 
3.94 
3.32 


2.96 
2.72 
2.56 
2.44 
2.35 


2.27 
2.21 
2.16 
2.12 
2.09 


2.06 
2.03 
2.00 
1.98 
1.96 


1.95 
1.93 
1.92 
1.91 
1.89 


1.88 
1.87 
1.87 
1.86 
1.85 


1.79 
1.74 
1.68 
1.63 


vı = Degrees of Freedom for Numerator 


10 
60.2 
9.39 
5.23 
3.92 
3.30 


2.94 
2.70 
2.54 
2.42 
2.32 


2.25 
2.19 
2.14 
2.10 
2.06 


2.03 
2.00 
1.98 
1.96 
1.94 


1.92 
1.90 
1.89 
1.88 
1.87 


1.86 
1.85 
1.84 
1.83 
1.82 


1.76 
1.71 
1.65 
1.60 


12 
60.7 
9.41 
5.22 
3.90 
3.27 


2.90 
2.67 
2.50 
2.38 
2.28 


2.21 
2.15 
2.10 
2.05 
2.02 


1.99 
1.96 
1.93 
1.91 
1.89 


1.87 
1.86 
1.84 
1.83 
1.82 


1.81 
1.80 
1.79 
1,78 
1.77 


1.71 
1.66 
1.60 
1.55 


15 
61.2 
9.42 
5.20 
3.87 
3.24 


2.87 
2.63 
2.46 
2.34 
2.24 


2.17 
2.10 
2.05 
2.01 
1.97 


1.94 
1.91 
1.89 
1.86 
1.84 


1.83 
1.81 
1.80 
1.78 
1.77 


1.76 
1.75 
1.74 
1.73 
1.72 


1.66 
1.60 
1.55 
1.49 


Source: [94] Abridged from Pearson, E. and Hartley, H. (1966), Biometrika Tables for 
Statisticians, Vol. 1, 3rd ed., Cambridge University Press, Cambridge. 


20 
61.7 
9.44 
5.18 
3.84 
3.21 


2.84 
2.59 
2.42 
2.30 
2.20 


2.12 
2.06 
2.01 
1.96 
1.92 


1.89 
1.86 
1.84 
1.81 
1.79 


1.78 
1.76 
1.74 
1.73 
1.72 


1.71 
1.70 
1.69 
1.68 
1.67 


1.61 
1.54 
1.48 
1.42 


30 
62.3 
9.46 
5.17 
3.82 
3.17 


2.80 
2.56 
2.38 
2.25 
2.16 


2.08 
2.01 
1.96 
1.91 
1.87 


1.84 
1.81 
1.78 
1.76 
1.74 


1.72 
1.70 
1.69 
1.67 
1.66 


1.65 
1.64 
1.63 
1.62 
1.61 


1.54 
1.48 
1.41 
1.34 
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Table A.4b .F-Distribution - Critical values of upper 5% points 


Р(Х < = 


/ Г (у+у;)/2](у, /у›)'! nni 
0 F(w/2)r(v;/2)(I-vyu/ v; 1*2? , 


du 


PUE > fy уа-о) =a=.05 


APPENDIX 


Note: F = 52/52 = [S\/v,]/[S2/v2] where sı? and s;? are independent mean squares estimating 
common variance o? and based on v; and v; degrees of freedom, respectively. 


d.f 
v2 1 
1 161.4 
2 18.51 
3 10.13 
4 7.71 
5 6.61 
6 5.99 
7 5.59 
8 5.32 
9 5.12 
10 4.96 
11 4.84 
12 4.75 
13 4.67 
14 4.60 
15 4.54 
16 4.49 
17 4.45 
18 4.41 
19 4.38 
20 4.35 
21 432 
22 4.30 
23 4.28 
24 4.26 
25 4.24 
26 4.23 
27 4.21 
28 4.20 
29 4.18 
30 4.17 
40 4.08 
60 4.00 
120 | 3.92 
со 3.84 


2 
200 
19.0 
9.55 
6.94 
5.79 


5.14 
4Л4 
4.46 
4.26 
4.10 


3.98 
3.89 
3.81 
3.74 
3.68 


3.63 
3.59 
3.55 
3.52 
3.49 


3.47 
3.44 
3.42 
3.40 
3.39 


3.37 
3.35 
3.34 
3.33 
3.32 


3.23 
3.15 
3.07 
3.00 


3 
216 
19.2 
9.28 
6.59 
5.41 


4.76 
4.35 
4.07 
3.86 
3.71 


3.59 
3.49 
3.41 
3.34 
3.29 


3.24 
3.20 
3.16 
3.13 
3.10 


3.07 
3.05 
3.03 
3.01 
2.99 


2.98 
2.96 
2.95 
2.93 
2.92 


2.84 
2.76 
2.68 
2.60 


4 
225 
19.2 
9.12 
6.39 
5.19 


4.53 
4.12 
3.84 
3.63 
3.48 


3.36 
3.26 
3.18 
3.11 
3.06 


3.01 
2.96 
2.93 
2.90 
2.87 


2.84 
2.82 
2.80 
2.78 
2.76 


2.74 
2.73 
2.71 
2.70 
2.69 


2.61 
2.53 
2.45 
2.37 


5 
230 
19.3 
9.01 
6.26 
5.05 


4.39 
3.97 
3.69 
3.48 
3.33 


3.20 
3.11 
3.03 
2.96 
2.90 


2.85 
2.81 
2.77 
2.74 
2.71 


2.68 
2.66 
2.64 
2.62 
2.60 


2.59 
2.57 
2.56 
2.55 
2.53 


2.45 
2.37 
2.29 
2.21 


6 
234 
19.3 
8.94 
6.16 

4.95 


4.28 
3.87 
3.58 
3.37 
3.22 


3.09 
3.00 
2.92 
2.85 
2.79 


2.74 
2.70 
2.66 
2.63 
2.60 


2.57 
2.55 
2.53 
2.51 
2.49 


2.47 
2.46 
2.45 
2.43 
2.42 


2.34 
2.25 
2.17 
2.10 


7 
237 
19.4 
8.89 
6.09 

4.88 


4.21 
3.79 
3.50 
3.29 
3.14 


3.01 
2.91 
2.83 
2.76 
2.71 


2.66 
2.61 
2.58 
2.54 
2.51 


2.49 
2.46 
2.44 
2.42 
2.40 


2.39 
2.37 
2.36 
2.35 
2.33 


2.25 
2.17 
2.09 
2.01 


8 
239 
19.4 
8.85 
6.04 

4.82 


4.15 
3.73 
3.44 
3.23 
3.07 


2.95 
2.85 
2.77 
2.70 
2.64 


2.59 
2.55 
2.51 
2.48 
2.45 


2.42 
2.40 
2.37 
2.36 
2.34 


2.32 
2.31 
2.29 
2.28 
2.27 


2.18 
2.10 
2.02 
1.94 


9 
241 
19.4 
8.81 
6.00 
4.77 


4.10 
3.68 
3.39 
3.18 
3.02 


2.90 
2.80 
2.71 
2.65 
2.59 


2.54 
2.49 
2.46 
2.42 
2.39 


2.37 
2.34 
2.32 
2.30 
2.28 


2.27 
2.25 
2.24 
2.22 
2.21 


2.12 
2.04 
1.96 
1.88 


v; = Degrees of Freedom for Numerator 


10 
242 
19.4 
8.79 
5.96 
4.74 


4.06 
3.64 
3.35 
3.14 
2.98 


2.85 
2.75 
2.67 
2.60 
2.54 


2.49 
2.45 
2.41 
2.38 
2.35 


2.32 
2.30 
2.27 
2.25 
2.24 


2.22 
2.20 
2.19 
2.18 
2.16 


2.08 
1.99 
1.91 
1.83 


12 
244 
19.41 
8.74 
5.91 
4.68 


4.00 
3.57 
3.28 
3.07 
2.91 


2.79 
2.69 
2.60 
2.53 
2.48 


2.42 
2.38 
2.34 
2.31 
2.28 


2.25 
2.23 
2.20 
2.18 
2.16 


2.15 
2.13 
2.12 
2.10 
2.09 


2.00 
1.92 
1.83 
1.75 


15 
246 
19.4 
8.70 
5.86 

4.62 


3.94 
3.51 
3.22 
3.01 
2.85 


2.12 
2.62 
2.53 
2.46 
2.40 


2.35 
2.31 
2.27 
2.23 
2.20 


2.18 
2.15 
2.13 
2.11 
2.09 


2.07 
2.06 
2.04 
2.03 
2.01 


1.92 
1.84 
1.75 
1.67 


Source: [94] Abridged from Pearson, E. and Hartley, Н. (1966), Biometrika Tables for 
Statisticians, Vol. 1, 3rd ed., Cambridge University Press, Cambridge. 


20 
248 
19.4 
8.66 
5.80 
4.56 


3.87 
3.44 
3.15 
2.94 
2.77 


2.65 
2.54 
2.46 
2.39 
2.33 


2.28 
2.23 
2.19 
2.16 
2.12 


2.10 
2.07 
2.05 
2.03 
2.01 


1.99 
1.97 
1.96 
1.94 
1.93 


1.84 
1.75 
1.66 
1.57 


30 
250 
19.5 
8.62 
5.75 
4.50 


3.81 
3.38 
3.08 
2.86 
2.70 


2.57 
2.47 
2.38 
2.31 
2.25 


2.19 
2.15 
2.11 
2.07 
2.04 


2.01 
1.98 
1.96 
1.94 
1.92 


1.90 
1.88 
1.87 
1.85 
1.84 


1.74 
1.65 
1.55 
1.46 


Table A.4c 


P(X </) = 


F-Distribution - Critical values of upper 1% points 


/ Г[(у,+у›)/2 (v/v; 2,1727 
0 T(v/2)r(v;/2)(1evu/ v, ^? 


du, 


P(F > fuiv2:i—a) =a=.0l 


419 


Note: F = 82/52 = [S/viV[S»/v;] where s;? and s? are independent mean squares estimating 
common variance с? and based on v; and v; degrees of freedom, respectively. 


d.f 
V5 ] 
1 4052 
2 98.50 
3 34.12 
4 21.20 
5 16.26 
6 13.75 
7 12.25 
8 11.26 
9 10.56 
10 | 10.04 
11 9.65 
12 9.33 
13 9.07 
14 8.86 
15 8.68 
16 8.53 
17 8.40 
18 8.29 
19 8.18 
20 8.10 
21 8.02 
22 7.95 
23 7.88 
24 7.82 
25 7.77 
26 7.72 
27 7.68 
28 7.64 
29 7.60 
30 7.56 
40 7.31 
60 7.08 
120 | 6.85 
© 6.63 


2 
5000 
99.0 
30.8 
18.0 
13.3 


10.9 
9.55 
8.65 
8.02 
7.56 


7.21 
6.93 
6.70 
6.51 
6.36 


6.23 
6.11 
6.01 
5.93 
5.85 


5.78 
5.72 
5.66 
5.61 
5.57 


5.53 
5.49 
5.45 
5.42 
5.39 


5.18 
4.98 
479 
4.61 


3 
5403 
99.2 
29.5 
16.7 
12.1 


9.78 
8.45 
7.59 
6.99 
6.55 


6.22 
5.95 
5.74 
5.56 
5.42 


5.29 
5.18 
5.09 
5.01 
4.94 


4.87 
4.82 
476 
472 
4.68 


4.64 
4.60 
4.57 
4.54 
4.51 


4.31 
4.13 
3.95 
3.78 


4 
5625 
99.3 
28.7 
16.0 
11.4 


9.15 
7.85 
7.01 
6.42 
5.99 


5.67 
5.41 
521 
5.04 
4.89 


4.77 
4.67 
4.58 
4.50 
4.43 


4.37 
4.31 
4.26 
4.22 
4.18 


4.14 
4.11 
4.07 
4.04 
4.02 


3.83 
3.65 
3.48 
3.32 


v; = Degrees of Freedom for Numerator 


5 
5764 
99.3 
28.2 
15.5 
11.0 


8.75 
7.46 
6.63 
6.06 
5.64 


5.32 
5.06 
4.86 
4.69 
4.56 


4.44 
4.34 
4.25 
4.17 
4.10 


4.04 
3.99 
3.94 
3.90 
3.85 


3.82 
3.78 
3.75 
3.73 
3.70 


3.51 
3.34 
3.17 
3.02 


6 
5859 
99.3 
27.9 
15.2 
10.7 


8.47 
7.19 
6.37 
5.80 
5.39 


5.07 
4.82 
4.62 
4.46 
432 


4.20 
4.10 
4.01 
3.94 
3.87 


3.81 
3.76 
3.71 
3.67 
3.63 


3.59 
3.56 
3.53 
3.50 
3.47 


3.29 
3.12 
2.96 
2.80 


7 
5928 
99.4 
27.7 
15.0 
10.5 


8.26 
6.99 
6.18 
5.61 
5.20 


4.89 
4.64 
4.44 
4.28 
4.14 


4.03 
3.93 
3.84 
3.77 
3.70 


3.64 
3.59 
3.54 
3.50 
3.46 


3.42 
3.39 
3.36 
3.33 
3.30 


3.12 
2.95 
2.79 
2.64 


8 
5981 
99.4 
27.5 
14.8 
10.3 


8.10 
6.84 
6.03 
5.47 
5.06 


4.74 
4.50 
4.30 
4.14 
4.00 


3.89 
3.79 
3.71 
3.63 
3.56 


3.51 
3.45 
3.41 
3.36 
3.32 


3.29 
3.26 
3.23 
3.20 
3.17 


2.99 
2.82 
2.66 
2.51 


9 
6022 
99.4 
27.4 
14.7 
10.2 


7.98 
6.72 
5.91 
5.35 
4.94 


4.63 
4.39 
4.19 
4.03 
3.89 


3.78 
3.68 
3.60 
3.52 
3.46 


3.40 
3.35 
3.30 
3.26 
3.22 


3.18 
3.15 
3.12 
3.09 
3.07 


2.89 
2.72 
2.56 
2.41 


10 
6056 
99.4 
27.2 
14.6 
10.1 


7.87 
6.62 
5.81 
5.26 
4.85 


4.54 
4.30 
4.10 
3.94 
3.80 


3.69 
3.59 
3.51 
3.43 
3.37 


3.31 
3.26 
3.21 
3.17 
3.13 


3.09 
3.06 
3.03 
3.00 
2.98 


2.80 
2.63 
2.47 
2.32 


12 

6106 
99.42 
27.05 
14.37 

9.89 


7.72 
6.47 
5.67 
5.11 
4.71 


4.40 
4.16 
3.96 
3.80 
3.67 


3.55 
3.46 
3.37 
3.30 
3.23 


3.17 
3.12 
3.07 
3.03 
2.99 


2.96 
2.93 
2.90 
2.87 
2.84 


2.66 
2.50 
2.34 
2.18 


15 
6157 
99.4 
26.9 
14.2 
9.72 


7.56 
6.31 
5.52 
4.96 
4.56 


4.25 
4.01 
3.82 
3.66 
3.52 


3.41 
3.31 
3.23 
3.15 
3.09 


3.03 
2.98 
2.93 
2.89 
2.85 


2.81 
2.78 
2.75 
2.73 
2.70 


2.52 
2.35 
2.19 
2.04 


Source: [94] Abridged from Pearson, E. and Hartley, Н. (1966), Biometrika Tables for 
Statisticians, Vol. 1, 3rd ed., Cambridge University Press, Cambridge. 


20 
6209 
99.5 
26.7 
14.0 
9.55 


7.40 
6.16 
5.36 
4.81 
441 


4.10 
3.86 
3.66 
3.51 
3.37 


3.26 
3.16 
3.08 
3.00 
2.94 


2.88 
2.83 
2.78 
2.74 
2.70 


2.66 
2.63 
2.60 
2.57 
2.55 


2.37 
2.20 
2.03 
1.88 


30 
6261 
99.5 
26.5 
13.8 
9.38 


7.23 
5.99 
5.20 
4.65 
5.25 


3.94 
3.70 
3.51 
3.35 
3.21 


3.10 
3.00 
2.92 
2.84 
2.78 


2.72 
2.67 
2.62 
2.58 
2.54 


2.50 
247 
2.44 
2.41 
2.39 


2.20 
2.03 
1.86 
1.70 
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Table A.5 Critical Values of the Durbin-Watson Statistic 


This table provides the two limiting values of critical d (d; and dy), corresponding to the two 
most extreme configurations of the regressors for testing autocorrelation. Note that the critical 
values are one-sided. (Significance level a = probability in lower tail.) 

For example, suppose there are n = 20 observations and p = 3 regressors, and we wished to 
test Ho: p = 0 versus H;: p> 0 at a = .05. Then if D fell below d; = 1.00, we would reject Hp. If 
D were above dy = 1.68, we could not reject Hp. If D were between d; and dy , our decision is 
indecisive. 


Sample p = Number of Independent Variables (Excluding the Constant) 
Size 


18 


20 


25 


30 


40 


50 


60 


80 


100 


Source: [30] Abridged from Tables I, II and III of Durbin, J. and Watson, G. (1951), Biometrika, 
Vol. 38, pp. 159-177. 
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Selection function, 361 

Serial correlation, 112 
Sherman-Morrison-Woodbury formula, 141 
Shrinkage parameter, 400 
Shunken estimator, 390 
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shrinking factors, 391 
Significance level, 41 
Simple linear regression, 51 

error model, 52 
Simultaneous confidence intervals, 78 
Singular matrix, 185 
Singular value decomposition, 167, 188, 

380 

singular vectors, 169 
Smoothing parameter, 326 
Spectral theorem, 149, 151, 210 
Spline functions, 317 
Spline models, 313, 323 
Splines, 277 
Standard deviation, 16 
Standard error, 65, 72 
Standard error of prediction, 95 
Standard normal distribution, 19 
Stepwise regression, 369, 373 
Stepwise regression variable selection, 229 
Studentized residual 

externally, 251 

internally, 251 
Sufficiency, 35 

jointly sufficient, 36 
Sufficient statistic, 35 
Sum of squares 

regression, 84 

residual, 84 

total, 84 


t-distribution, 23 

Test 
generalized likelihood ratio, 43 
likelihood ratio, 41 
one-sided hypotheses, 45 
two-tailed, 42 
uniformly most powerful, 43 

Time series analysis, 301 

Time series data, 333 

Total least squares, 56 

Total variance, 214 

Trace, 135 

Transformations, 113, 276 
Atkinson’s modification, 278 
Box-Cox method, 119, 280, 281 
Box-Tidwell method, 118, 277, 281 
in x, 276 
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intrinsically linear, 118 orthonormal, 148 
linearizable, 280 span, 146 
logarithmic, 25 subspace, 146 
of y, 279 Vector differentiation, 186, 243 
power family, 120 Vector of observations, 183 
variance equalizing, 288 centered, 191 
variance stabilizing, 299 Vector of regression coefficients, 183 
Tukey’s rule, 367 
Two-tailed test, 44 Weighted least squares, 288 
equations, 352 
Unbiased estimator, 23 estimator, 291 
Unbiasedness, 33 iteratively reweighted, 294 
Uncorrelatedness, 16 sum of squares, 290 
Uniform measure, 10 weights, 289 
Uniformly most powerful test, 43 Weighted mean, 308 
Variable Zero intercept model, 94 
dependent, 51 
design, 51 
dummy, 328 


independent, 51 

qualitative, 327 

quantitative, 327 

response, 51 
Variable plots, 252 
Variable selection problem, 359 
Variance, 16, 17 
Variance decomposition proportion, 387 
Variance equalizing transformations 

weighted least squares, 288 
Variance inflation factor (VIF), 214 
Variance multiplication factor, 72, 77 
Variance stabilizing transformations, 299 
Variance-covariance matrix, 156 
Vector 

angle, 145 

basis, 146 

canonical base, 146 

column space spanned, 146 

dimension, 146 

dot product, 143 

inner product, 143 

length, 144 

linear combination, 138 

linearly dependent, 138 

linearly independent, 138 

orthogonal, 145 

orthogonal base, 146 


